Skip to content

Conversation

@eranrs
Copy link
Contributor

@eranrs eranrs commented Dec 11, 2025

Test categories:

  • Functional tests: connection management, dispatch, combine, masking, end-to-end
  • Performance tests: control plane latency, data plane throughput
  • Bug reproductions: documented known issues (BUG-01 through BUG-05)
  • Unit tests: individual component validation

Infrastructure:

  • Multi-process test runner with GPU-per-rank assignment
  • TCP-based rank server for process coordination
  • Results collector for CI/CD integration
  • Pytest marks for selective test execution

Documentation:

  • README files for each test category
  • Known issues and workarounds documented
  • Multi-node testing considerations in .cursorrules

What?

This PR adds a comprehensive pytest-based test suite for the nixl_ep example (Expert-Parallel communication buffer). The test suite includes:
Functional tests: Connection management, dispatch, combine, masking, and end-to-end workflows
Performance tests: Control plane latency and data plane throughput benchmarks
Bug reproductions: Documented known issues (BUG-01 through BUG-05) for tracking and regression testing
Unit tests: Individual component validation
Supporting infrastructure:
Multi-process test runner with GPU-per-rank assignment
TCP-based rank server for process coordination
Results collector for CI/CD integration
Pytest marks for selective test execution

Why?

The nixl_ep example currently lacks automated tests, making it difficult to:
Verify correctness after code changes
Track performance regressions
Document and reproduce known issues
Validate multi-GPU configurations
This test suite enables CI/CD integration and provides a foundation for quality assurance as the codebase evolves.

How?

Tests are located in test/python/nixl_ep_tests/ with the following structure:
functional/ - Multi-process tests using mp.spawn with 8 GPUs
perf/ - Benchmarks for control/data plane operations
bugs/ - Reproduction cases for known issues
unit/ - Single-process component tests
utils/ - Shared test infrastructure (mp_runner, rank_server, etc.)
Run with: python -m pytest test/python/nixl_ep_tests/ -v

@eranrs eranrs requested a review from a team as a code owner December 11, 2025 12:44
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi eranrs! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@eranrs eranrs force-pushed the adding_tests branch 2 times, most recently from bfa6295 to 50969d3 Compare December 11, 2025 14:26
@eranrs eranrs force-pushed the adding_tests branch 6 times, most recently from 579bb40 to 69799fa Compare December 30, 2025 12:19
Minimal performance test for NIXL EP Buffer data plane operations:
- dispatch throughput/latency (dispatch only)
- combine throughput/latency (one dispatch, many combines)
- e2e throughput/latency (dispatch + combine cycles)

Usage: python3 test_data_plane.py --num-processes=8 --mode=e2e

Includes:
- mp_runner.py: Multi-process test runner with GPU/UCX coordination
- rank_server.py: TCP-based coordination server for distributed tests
Adds test_control_plane.py to measure latency of:
- Buffer initialization (init)
- connect_ranks()
- disconnect_ranks()
- destroy()
- Full cycle (init → connect → disconnect → reconnect → destroy)

Supports multiple expert counts per run and both IPC and RDMA backends.
Adds support for using TCPStore instead of etcd for metadata exchange
during control plane performance tests. Uses port offset (+1000) to
avoid conflicts with torchrun's default port.

Usage: python3 test_control_plane.py --use-tcp-store --nvlink-backend=none
When --use-tcp-store is passed, skip the etcd availability check
since metadata exchange is done via TCPStore instead.
The use_tcp_store parameter was captured as a named argument but not
passed through to the child processes, causing them to default to
use_tcp_store=False and hang waiting for etcd metadata.
- Add store_group.py helper from PR 1155
- Start dedicated TCPStore server process in mp_runner
- Use create_client_store() in workers (no world_size needed)
- Cleaner separation: master server + client workers
- Follows ai-dynamo/nixl PR 1155 elastic test pattern
- Add use_tcp_store parameter to setup_worker_environment()
- Only set NIXL_ETCD_ENDPOINTS when NOT using TCPStore
- This prevents C++ code from activating etcd path when TCPStore is requested
- Matches elastic.py implementation from pr-1155 branch
- Fixes control plane tests to work with --use-tcp-store flag
@eranrs eranrs requested a review from a team as a code owner January 4, 2026 09:05
eranrs added 7 commits January 4, 2026 11:19
- Use standard PyTorch distributed env vars (WORLD_SIZE, RANK, MASTER_ADDR)
- CLI flags (--world-size, --rank, --master-addr) override env vars
- WORLD_SIZE = number of nodes (not total ranks)
- num-processes = GPUs per node
- RANK 0 = master node (runs TCPStore and rank server)
- RANK > 0 = worker nodes (connect to master)
- Validation: error if MASTER_ADDR not set for workers
- Calculate global rank = node_rank * procs_per_node + local_rank
- Backward compatible: defaults to single-node (WORLD_SIZE=1, RANK=0)

Example multi-node usage:
  # Master (node 0)
  WORLD_SIZE=2 RANK=0 MASTER_ADDR=node0 python3 test_control_plane.py --num-processes=8 --use-tcp-store

  # Worker (node 1)
  WORLD_SIZE=2 RANK=1 MASTER_ADDR=node0 python3 test_control_plane.py --num-processes=8 --use-tcp-store
- Only master node (RANK=0) checks/cleans etcd state
- Only master node starts TCPStore server
- Only master node starts rank coordination server
- Worker nodes connect to master's services
- Add clear logging for multi-node master/worker role
- Prevents conflicts when multiple nodes try to manage shared services
- Document WORLD_SIZE, RANK, MASTER_ADDR environment variables
- Provide examples for master and worker node setup
- Explain CLI flag alternatives (--world-size, --rank, --master-addr)
- Clarify that WORLD_SIZE = number of nodes, not total ranks
- Add note that master node runs TCPStore/rank server
- Recommend --use-tcp-store to avoid etcd dependency
- RankClient uses get_rank() not register_rank()
- Multi-node: calculate deterministic global_rank = RANK * num_processes + local_rank
- Single-node: use server-assigned global_rank (auto-incrementing)
- Prevents non-deterministic rank assignment in multi-node setup
- Fixes AttributeError: 'RankClient' object has no attribute 'register_rank'
- Spawned worker processes weren't inheriting UCX_TLS from parent shell
- Now explicitly set UCX_TLS=rc_mlx5,dc_mlx5,tcp if not already set
- Critical for cross-node RDMA communication in multi-node tests
- Set UCX_TLS=rc_mlx5,dc_mlx5,tcp,^cuda_ipc before Buffer creation
- Prevents buffer.py from overwriting with only ^cuda_ipc
- Ensures RDMA transports are enabled for cross-node communication
eranrs added 11 commits January 5, 2026 13:38
Workers calling clear_barriers() after connecting was clearing barriers
that master's processes were already waiting on, causing barrier timeouts.
…rocess index

Previous approach used connection-order rank assignment from rank server,
which caused unpredictable rank distribution across nodes. Now:
- global_rank = node_rank * num_processes + local_rank
- Node 0 gets ranks 0-7, Node 1 gets ranks 8-15, etc.
- Removes dependency on rank server for rank assignment (still used for barriers)
Workers now log:
- ✓ TCPStore ready at <master>:<port>
- ✓ Master is alive! Connected to rank server at <master>:<port>
All log messages from run_multiprocess_test now start with [Node X]
where X is the node's rank (0=master, 1-N=workers)
…ages

All server readiness polling messages now include the node prefix for
consistent multi-node debugging output
- Pass node_rank from mp_runner to test function via kwargs
- Configure logger formatter in test function with [Node X] prefix
- All output including results tables now prefixed for multi-node debugging
Set logger formatter early in main() so all output including
the test configuration header and results tables have the prefix
- Use logger formatter instead of manual prefix in mp_runner.py
- Remove log_prefix parameter from wait_for_tcp_port and wait_for_server
- All logs now get prefix from formatter automatically
- Add --world-size, --rank, --master-addr parameters
- Add --use-tcp-store for TCPStore metadata exchange
- Add [Node X] prefix to all log messages
- Support TCPStore in _run_data_plane_test function
- Consistent with test_control_plane.py interface
The \n at the start of the log message caused the prefix to appear
after the newline. Using separate logger.info('') for blank line.
Was using 'from store_group import store_group' which failed.
Changed to 'import store_group' to match test_control_plane.py
@eranrs eranrs force-pushed the adding_tests branch 4 times, most recently from 4607e2a to 4c5e2bb Compare January 5, 2026 16:18
Copy link
Contributor

@itayalroy itayalroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does dispatch/combine test results compare to what we see with elastic.py? can you share some results?

eranrs added 4 commits January 7, 2026 13:22
- Remove double torch.cuda.synchronize() around connect_ranks
- Clean all etcd keys instead of just nixl/* prefix
- Simplify etcd_server logic to always use master_addr
- Remove manual wait_for_tcp_port, use TCPStore built-in timeout
- Invert NIC discovery: default=skip, add --discover-nics flag
- Invert metadata exchange: default=TCPStore, add --use-etcd flag
- Add wait_for_tcp_port for both master and worker nodes
- Prevents race condition where workers try to connect before server is ready
- Fixes timeout errors in single-node TCPStore tests
- Store returned object in variable to prevent Python GC
- Without this, TCPStore server would immediately shut down
- Fixes 'TCP port not ready' timeout error
- Add noqa comment for intentionally unused _store variable
- Variable must be kept alive to prevent garbage collection
@eranrs eranrs changed the title Add comprehensive test suite for nixl_ep example Add data plane tests for nixl_ep Jan 7, 2026
eranrs added 3 commits January 7, 2026 19:55
Focus PR scope on test/python/nixl_ep_perf only
- Add spacing around arithmetic operator (world_size - 1)
- Fix import order to be alphabetical (nixl_ep after numpy/torch)
- Update copyright year to 2025-2026
- Update copyright headers for files modified in 2026
- Fix import order: nixl_ep before numpy/torch (CI requirement)
@eranrs
Copy link
Contributor Author

eranrs commented Jan 7, 2026

/build

Copy link
Contributor

@itayalroy itayalroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the size of this framework, my review focused on the high-level design. Overall it looks good.

# Add TCP fallback interfaces (like elastic.py) for cross-node communication
# These are IPoIB (InfiniBand) interfaces used as TCP fallback
tcp_nics = (
",ibp26s0,ibp44s0,ibp64s0,ibp101s0,ibp156s0,ibp173s0,ibp192s0,ibp227s0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks setup specific

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants