-
Notifications
You must be signed in to change notification settings - Fork 224
Add data plane tests for nixl_ep #1109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi eranrs! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
bfa6295 to
50969d3
Compare
579bb40 to
69799fa
Compare
Minimal performance test for NIXL EP Buffer data plane operations: - dispatch throughput/latency (dispatch only) - combine throughput/latency (one dispatch, many combines) - e2e throughput/latency (dispatch + combine cycles) Usage: python3 test_data_plane.py --num-processes=8 --mode=e2e Includes: - mp_runner.py: Multi-process test runner with GPU/UCX coordination - rank_server.py: TCP-based coordination server for distributed tests
Adds test_control_plane.py to measure latency of: - Buffer initialization (init) - connect_ranks() - disconnect_ranks() - destroy() - Full cycle (init → connect → disconnect → reconnect → destroy) Supports multiple expert counts per run and both IPC and RDMA backends.
Adds support for using TCPStore instead of etcd for metadata exchange during control plane performance tests. Uses port offset (+1000) to avoid conflicts with torchrun's default port. Usage: python3 test_control_plane.py --use-tcp-store --nvlink-backend=none
When --use-tcp-store is passed, skip the etcd availability check since metadata exchange is done via TCPStore instead.
The use_tcp_store parameter was captured as a named argument but not passed through to the child processes, causing them to default to use_tcp_store=False and hang waiting for etcd metadata.
- Add store_group.py helper from PR 1155 - Start dedicated TCPStore server process in mp_runner - Use create_client_store() in workers (no world_size needed) - Cleaner separation: master server + client workers - Follows ai-dynamo/nixl PR 1155 elastic test pattern
- Add use_tcp_store parameter to setup_worker_environment() - Only set NIXL_ETCD_ENDPOINTS when NOT using TCPStore - This prevents C++ code from activating etcd path when TCPStore is requested - Matches elastic.py implementation from pr-1155 branch - Fixes control plane tests to work with --use-tcp-store flag
- Use standard PyTorch distributed env vars (WORLD_SIZE, RANK, MASTER_ADDR) - CLI flags (--world-size, --rank, --master-addr) override env vars - WORLD_SIZE = number of nodes (not total ranks) - num-processes = GPUs per node - RANK 0 = master node (runs TCPStore and rank server) - RANK > 0 = worker nodes (connect to master) - Validation: error if MASTER_ADDR not set for workers - Calculate global rank = node_rank * procs_per_node + local_rank - Backward compatible: defaults to single-node (WORLD_SIZE=1, RANK=0) Example multi-node usage: # Master (node 0) WORLD_SIZE=2 RANK=0 MASTER_ADDR=node0 python3 test_control_plane.py --num-processes=8 --use-tcp-store # Worker (node 1) WORLD_SIZE=2 RANK=1 MASTER_ADDR=node0 python3 test_control_plane.py --num-processes=8 --use-tcp-store
- Only master node (RANK=0) checks/cleans etcd state - Only master node starts TCPStore server - Only master node starts rank coordination server - Worker nodes connect to master's services - Add clear logging for multi-node master/worker role - Prevents conflicts when multiple nodes try to manage shared services
- Document WORLD_SIZE, RANK, MASTER_ADDR environment variables - Provide examples for master and worker node setup - Explain CLI flag alternatives (--world-size, --rank, --master-addr) - Clarify that WORLD_SIZE = number of nodes, not total ranks - Add note that master node runs TCPStore/rank server - Recommend --use-tcp-store to avoid etcd dependency
- RankClient uses get_rank() not register_rank() - Multi-node: calculate deterministic global_rank = RANK * num_processes + local_rank - Single-node: use server-assigned global_rank (auto-incrementing) - Prevents non-deterministic rank assignment in multi-node setup - Fixes AttributeError: 'RankClient' object has no attribute 'register_rank'
- Spawned worker processes weren't inheriting UCX_TLS from parent shell - Now explicitly set UCX_TLS=rc_mlx5,dc_mlx5,tcp if not already set - Critical for cross-node RDMA communication in multi-node tests
- Set UCX_TLS=rc_mlx5,dc_mlx5,tcp,^cuda_ipc before Buffer creation - Prevents buffer.py from overwriting with only ^cuda_ipc - Ensures RDMA transports are enabled for cross-node communication
Workers calling clear_barriers() after connecting was clearing barriers that master's processes were already waiting on, causing barrier timeouts.
…rocess index Previous approach used connection-order rank assignment from rank server, which caused unpredictable rank distribution across nodes. Now: - global_rank = node_rank * num_processes + local_rank - Node 0 gets ranks 0-7, Node 1 gets ranks 8-15, etc. - Removes dependency on rank server for rank assignment (still used for barriers)
Workers now log: - ✓ TCPStore ready at <master>:<port> - ✓ Master is alive! Connected to rank server at <master>:<port>
All log messages from run_multiprocess_test now start with [Node X] where X is the node's rank (0=master, 1-N=workers)
…ages All server readiness polling messages now include the node prefix for consistent multi-node debugging output
- Pass node_rank from mp_runner to test function via kwargs - Configure logger formatter in test function with [Node X] prefix - All output including results tables now prefixed for multi-node debugging
Set logger formatter early in main() so all output including the test configuration header and results tables have the prefix
- Use logger formatter instead of manual prefix in mp_runner.py - Remove log_prefix parameter from wait_for_tcp_port and wait_for_server - All logs now get prefix from formatter automatically
- Add --world-size, --rank, --master-addr parameters - Add --use-tcp-store for TCPStore metadata exchange - Add [Node X] prefix to all log messages - Support TCPStore in _run_data_plane_test function - Consistent with test_control_plane.py interface
The \n at the start of the log message caused the prefix to appear
after the newline. Using separate logger.info('') for blank line.
Was using 'from store_group import store_group' which failed. Changed to 'import store_group' to match test_control_plane.py
4607e2a to
4c5e2bb
Compare
itayalroy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does dispatch/combine test results compare to what we see with elastic.py? can you share some results?
- Remove double torch.cuda.synchronize() around connect_ranks - Clean all etcd keys instead of just nixl/* prefix - Simplify etcd_server logic to always use master_addr - Remove manual wait_for_tcp_port, use TCPStore built-in timeout - Invert NIC discovery: default=skip, add --discover-nics flag - Invert metadata exchange: default=TCPStore, add --use-etcd flag
- Add wait_for_tcp_port for both master and worker nodes - Prevents race condition where workers try to connect before server is ready - Fixes timeout errors in single-node TCPStore tests
- Store returned object in variable to prevent Python GC - Without this, TCPStore server would immediately shut down - Fixes 'TCP port not ready' timeout error
- Add noqa comment for intentionally unused _store variable - Variable must be kept alive to prevent garbage collection
Focus PR scope on test/python/nixl_ep_perf only
- Add spacing around arithmetic operator (world_size - 1) - Fix import order to be alphabetical (nixl_ep after numpy/torch) - Update copyright year to 2025-2026
- Update copyright headers for files modified in 2026 - Fix import order: nixl_ep before numpy/torch (CI requirement)
|
/build |
itayalroy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the size of this framework, my review focused on the high-level design. Overall it looks good.
| # Add TCP fallback interfaces (like elastic.py) for cross-node communication | ||
| # These are IPoIB (InfiniBand) interfaces used as TCP fallback | ||
| tcp_nics = ( | ||
| ",ibp26s0,ibp44s0,ibp64s0,ibp101s0,ibp156s0,ibp173s0,ibp192s0,ibp227s0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks setup specific
Test categories:
Infrastructure:
Documentation:
What?
This PR adds a comprehensive pytest-based test suite for the nixl_ep example (Expert-Parallel communication buffer). The test suite includes:
Functional tests: Connection management, dispatch, combine, masking, and end-to-end workflows
Performance tests: Control plane latency and data plane throughput benchmarks
Bug reproductions: Documented known issues (BUG-01 through BUG-05) for tracking and regression testing
Unit tests: Individual component validation
Supporting infrastructure:
Multi-process test runner with GPU-per-rank assignment
TCP-based rank server for process coordination
Results collector for CI/CD integration
Pytest marks for selective test execution
Why?
The nixl_ep example currently lacks automated tests, making it difficult to:
Verify correctness after code changes
Track performance regressions
Document and reproduce known issues
Validate multi-GPU configurations
This test suite enables CI/CD integration and provides a foundation for quality assurance as the codebase evolves.
How?
Tests are located in test/python/nixl_ep_tests/ with the following structure:
functional/ - Multi-process tests using mp.spawn with 8 GPUs
perf/ - Benchmarks for control/data plane operations
bugs/ - Reproduction cases for known issues
unit/ - Single-process component tests
utils/ - Shared test infrastructure (mp_runner, rank_server, etc.)
Run with: python -m pytest test/python/nixl_ep_tests/ -v