Add data plane tests for nixl_ep #1109

eranrs · 2025-12-11T12:44:09Z

Test categories:

Functional tests: connection management, dispatch, combine, masking, end-to-end
Performance tests: control plane latency, data plane throughput
Bug reproductions: documented known issues (BUG-01 through BUG-05)
Unit tests: individual component validation

Infrastructure:

Multi-process test runner with GPU-per-rank assignment
TCP-based rank server for process coordination
Results collector for CI/CD integration
Pytest marks for selective test execution

Documentation:

README files for each test category
Known issues and workarounds documented
Multi-node testing considerations in .cursorrules

What?

This PR adds a comprehensive pytest-based test suite for the nixl_ep example (Expert-Parallel communication buffer). The test suite includes:
Functional tests: Connection management, dispatch, combine, masking, and end-to-end workflows
Performance tests: Control plane latency and data plane throughput benchmarks
Bug reproductions: Documented known issues (BUG-01 through BUG-05) for tracking and regression testing
Unit tests: Individual component validation
Supporting infrastructure:
Multi-process test runner with GPU-per-rank assignment
TCP-based rank server for process coordination
Results collector for CI/CD integration
Pytest marks for selective test execution

Why?

The nixl_ep example currently lacks automated tests, making it difficult to:
Verify correctness after code changes
Track performance regressions
Document and reproduce known issues
Validate multi-GPU configurations
This test suite enables CI/CD integration and provides a foundation for quality assurance as the codebase evolves.

How?

Tests are located in test/python/nixl_ep_tests/ with the following structure:
functional/ - Multi-process tests using mp.spawn with 8 GPUs
perf/ - Benchmarks for control/data plane operations
bugs/ - Reproduction cases for known issues
unit/ - Single-process component tests
utils/ - Shared test infrastructure (mp_runner, rank_server, etc.)
Run with: python -m pytest test/python/nixl_ep_tests/ -v

copy-pr-bot · 2025-12-11T12:44:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-12-11T12:44:19Z

👋 Hi eranrs! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

Minimal performance test for NIXL EP Buffer data plane operations: - dispatch throughput/latency (dispatch only) - combine throughput/latency (one dispatch, many combines) - e2e throughput/latency (dispatch + combine cycles) Usage: python3 test_data_plane.py --num-processes=8 --mode=e2e Includes: - mp_runner.py: Multi-process test runner with GPU/UCX coordination - rank_server.py: TCP-based coordination server for distributed tests

Adds test_control_plane.py to measure latency of: - Buffer initialization (init) - connect_ranks() - disconnect_ranks() - destroy() - Full cycle (init → connect → disconnect → reconnect → destroy) Supports multiple expert counts per run and both IPC and RDMA backends.

Adds support for using TCPStore instead of etcd for metadata exchange during control plane performance tests. Uses port offset (+1000) to avoid conflicts with torchrun's default port. Usage: python3 test_control_plane.py --use-tcp-store --nvlink-backend=none

When --use-tcp-store is passed, skip the etcd availability check since metadata exchange is done via TCPStore instead.

The use_tcp_store parameter was captured as a named argument but not passed through to the child processes, causing them to default to use_tcp_store=False and hang waiting for etcd metadata.

- Add store_group.py helper from PR 1155 - Start dedicated TCPStore server process in mp_runner - Use create_client_store() in workers (no world_size needed) - Cleaner separation: master server + client workers - Follows ai-dynamo/nixl PR 1155 elastic test pattern

- Add use_tcp_store parameter to setup_worker_environment() - Only set NIXL_ETCD_ENDPOINTS when NOT using TCPStore - This prevents C++ code from activating etcd path when TCPStore is requested - Matches elastic.py implementation from pr-1155 branch - Fixes control plane tests to work with --use-tcp-store flag

- Use standard PyTorch distributed env vars (WORLD_SIZE, RANK, MASTER_ADDR) - CLI flags (--world-size, --rank, --master-addr) override env vars - WORLD_SIZE = number of nodes (not total ranks) - num-processes = GPUs per node - RANK 0 = master node (runs TCPStore and rank server) - RANK > 0 = worker nodes (connect to master) - Validation: error if MASTER_ADDR not set for workers - Calculate global rank = node_rank * procs_per_node + local_rank - Backward compatible: defaults to single-node (WORLD_SIZE=1, RANK=0) Example multi-node usage: # Master (node 0) WORLD_SIZE=2 RANK=0 MASTER_ADDR=node0 python3 test_control_plane.py --num-processes=8 --use-tcp-store # Worker (node 1) WORLD_SIZE=2 RANK=1 MASTER_ADDR=node0 python3 test_control_plane.py --num-processes=8 --use-tcp-store

- Only master node (RANK=0) checks/cleans etcd state - Only master node starts TCPStore server - Only master node starts rank coordination server - Worker nodes connect to master's services - Add clear logging for multi-node master/worker role - Prevents conflicts when multiple nodes try to manage shared services

- Document WORLD_SIZE, RANK, MASTER_ADDR environment variables - Provide examples for master and worker node setup - Explain CLI flag alternatives (--world-size, --rank, --master-addr) - Clarify that WORLD_SIZE = number of nodes, not total ranks - Add note that master node runs TCPStore/rank server - Recommend --use-tcp-store to avoid etcd dependency

- RankClient uses get_rank() not register_rank() - Multi-node: calculate deterministic global_rank = RANK * num_processes + local_rank - Single-node: use server-assigned global_rank (auto-incrementing) - Prevents non-deterministic rank assignment in multi-node setup - Fixes AttributeError: 'RankClient' object has no attribute 'register_rank'

- Spawned worker processes weren't inheriting UCX_TLS from parent shell - Now explicitly set UCX_TLS=rc_mlx5,dc_mlx5,tcp if not already set - Critical for cross-node RDMA communication in multi-node tests

- Set UCX_TLS=rc_mlx5,dc_mlx5,tcp,^cuda_ipc before Buffer creation - Prevents buffer.py from overwriting with only ^cuda_ipc - Ensures RDMA transports are enabled for cross-node communication

Workers calling clear_barriers() after connecting was clearing barriers that master's processes were already waiting on, causing barrier timeouts.

…rocess index Previous approach used connection-order rank assignment from rank server, which caused unpredictable rank distribution across nodes. Now: - global_rank = node_rank * num_processes + local_rank - Node 0 gets ranks 0-7, Node 1 gets ranks 8-15, etc. - Removes dependency on rank server for rank assignment (still used for barriers)

Workers now log: - ✓ TCPStore ready at <master>:<port> - ✓ Master is alive! Connected to rank server at <master>:<port>

All log messages from run_multiprocess_test now start with [Node X] where X is the node's rank (0=master, 1-N=workers)

…ages All server readiness polling messages now include the node prefix for consistent multi-node debugging output

- Pass node_rank from mp_runner to test function via kwargs - Configure logger formatter in test function with [Node X] prefix - All output including results tables now prefixed for multi-node debugging

Set logger formatter early in main() so all output including the test configuration header and results tables have the prefix

- Use logger formatter instead of manual prefix in mp_runner.py - Remove log_prefix parameter from wait_for_tcp_port and wait_for_server - All logs now get prefix from formatter automatically

- Add --world-size, --rank, --master-addr parameters - Add --use-tcp-store for TCPStore metadata exchange - Add [Node X] prefix to all log messages - Support TCPStore in _run_data_plane_test function - Consistent with test_control_plane.py interface

The \n at the start of the log message caused the prefix to appear after the newline. Using separate logger.info('') for blank line.

Was using 'from store_group import store_group' which failed. Changed to 'import store_group' to match test_control_plane.py

itayalroy

How does dispatch/combine test results compare to what we see with elastic.py? can you share some results?

test/python/nixl_ep_perf/mp_runner.py

test/python/nixl_ep_perf/README.md

test/python/nixl_ep_perf/test_data_plane.py

- Remove double torch.cuda.synchronize() around connect_ranks - Clean all etcd keys instead of just nixl/* prefix - Simplify etcd_server logic to always use master_addr - Remove manual wait_for_tcp_port, use TCPStore built-in timeout - Invert NIC discovery: default=skip, add --discover-nics flag - Invert metadata exchange: default=TCPStore, add --use-etcd flag

- Add wait_for_tcp_port for both master and worker nodes - Prevents race condition where workers try to connect before server is ready - Fixes timeout errors in single-node TCPStore tests

- Store returned object in variable to prevent Python GC - Without this, TCPStore server would immediately shut down - Fixes 'TCP port not ready' timeout error

- Add noqa comment for intentionally unused _store variable - Variable must be kept alive to prevent garbage collection

Focus PR scope on test/python/nixl_ep_perf only

- Add spacing around arithmetic operator (world_size - 1) - Fix import order to be alphabetical (nixl_ep after numpy/torch) - Update copyright year to 2025-2026

- Update copyright headers for files modified in 2026 - Fix import order: nixl_ep before numpy/torch (CI requirement)

eranrs · 2026-01-07T18:08:33Z

/build

itayalroy

Given the size of this framework, my review focused on the high-level design. Overall it looks good.

itayalroy · 2026-01-14T22:27:28Z

test/python/nixl_ep_perf/mp_runner.py

+        # Add TCP fallback interfaces (like elastic.py) for cross-node communication
+        # These are IPoIB (InfiniBand) interfaces used as TCP fallback
+        tcp_nics = (
+            ",ibp26s0,ibp44s0,ibp64s0,ibp101s0,ibp156s0,ibp173s0,ibp192s0,ibp227s0"


this looks setup specific

eranrs requested a review from a team as a code owner December 11, 2025 12:44

pull-request-size bot added the size/XXL label Dec 11, 2025

github-actions bot added the external-contribution label Dec 11, 2025

eranrs force-pushed the adding_tests branch 2 times, most recently from bfa6295 to 50969d3 Compare December 11, 2025 14:26

eranrs force-pushed the adding_tests branch 6 times, most recently from 579bb40 to 69799fa Compare December 30, 2025 12:19

eranrs force-pushed the adding_tests branch from 69799fa to e6a6e68 Compare December 30, 2025 12:25

eranrs added 6 commits December 30, 2025 15:34

fix(test): skip etcd check when using TCPStore

1561aa7

When --use-tcp-store is passed, skip the etcd availability check since metadata exchange is done via TCPStore instead.

fix(test): pass use_tcp_store to test function via kwargs

6b12e37

The use_tcp_store parameter was captured as a named argument but not passed through to the child processes, causing them to default to use_tcp_store=False and hang waiting for etcd metadata.

eranrs force-pushed the adding_tests branch from 8a07f5c to 666fe2e Compare January 4, 2026 09:05

eranrs requested a review from a team as a code owner January 4, 2026 09:05

eranrs added 7 commits January 4, 2026 11:19

fix(test): inherit UCX_TLS env var for RDMA multi-node support

f7422bb

- Spawned worker processes weren't inheriting UCX_TLS from parent shell - Now explicitly set UCX_TLS=rc_mlx5,dc_mlx5,tcp if not already set - Critical for cross-node RDMA communication in multi-node tests

fix(test): pre-set UCX_TLS with RDMA transports and cuda_ipc exclusion

30f8017

- Set UCX_TLS=rc_mlx5,dc_mlx5,tcp,^cuda_ipc before Buffer creation - Prevents buffer.py from overwriting with only ^cuda_ipc - Ensures RDMA transports are enabled for cross-node communication

debug: add UCX_TLS logging to diagnose metadata issue

063f871

eranrs added 11 commits January 5, 2026 13:38

Fix: Workers should not clear barriers - only master should

5a77bd0

Workers calling clear_barriers() after connecting was clearing barriers that master's processes were already waiting on, causing barrier timeouts.

Add explicit logging when workers confirm master is alive

f014727

Workers now log: - ✓ TCPStore ready at <master>:<port> - ✓ Master is alive! Connected to rank server at <master>:<port>

Add [Node X] prefix to all log messages for easier multi-node debugging

b28c647

All log messages from run_multiprocess_test now start with [Node X] where X is the node's rank (0=master, 1-N=workers)

Add [Node X] prefix to wait_for_tcp_port and wait_for_server log mess…

67bd73b

…ages All server readiness polling messages now include the node prefix for consistent multi-node debugging output

Add [Node X] prefix to all test results and headers

9e8cffd

- Pass node_rank from mp_runner to test function via kwargs - Configure logger formatter in test function with [Node X] prefix - All output including results tables now prefixed for multi-node debugging

Add [Node X] prefix to main() header and results output

f15b3cc

Set logger formatter early in main() so all output including the test configuration header and results tables have the prefix

Fix double [Node X] prefix in log messages

ebc5f41

- Use logger formatter instead of manual prefix in mp_runner.py - Remove log_prefix parameter from wait_for_tcp_port and wait_for_server - All logs now get prefix from formatter automatically

Fix missing [Node X] prefix on 'Running:' log message

a18161d

The \n at the start of the log message caused the prefix to appear after the newline. Using separate logger.info('') for blank line.

Fix store_group import in test_data_plane.py

7444227

Was using 'from store_group import store_group' which failed. Changed to 'import store_group' to match test_control_plane.py

eranrs force-pushed the adding_tests branch 4 times, most recently from 4607e2a to 4c5e2bb Compare January 5, 2026 16:18

Remove test_control_plane.py - focus PR on data plane tests only

901a542

eranrs force-pushed the adding_tests branch from 4c5e2bb to 901a542 Compare January 5, 2026 16:46

Fix pre-commit issues: isort, flake8 (unused imports/variables)

aa78c1d

itayalroy reviewed Jan 6, 2026

View reviewed changes

eranrs added 4 commits January 7, 2026 13:22

fix(nixl_ep): ensure TCPStore server is ready before spawning workers

f119a64

- Add wait_for_tcp_port for both master and worker nodes - Prevents race condition where workers try to connect before server is ready - Fixes timeout errors in single-node TCPStore tests

fix(nixl_ep): keep TCPStore object alive to prevent garbage collection

58a7f77

- Store returned object in variable to prevent Python GC - Without this, TCPStore server would immediately shut down - Fixes 'TCP port not ready' timeout error

style(nixl_ep): fix flake8 unused variable warning

4c46c51

- Add noqa comment for intentionally unused _store variable - Variable must be kept alive to prevent garbage collection

eranrs changed the title ~~Add comprehensive test suite for nixl_ep example~~ Add data plane tests for nixl_ep Jan 7, 2026

eranrs added 3 commits January 7, 2026 19:55

chore: revert changes to examples/ and pyproject.toml

f71f691

Focus PR scope on test/python/nixl_ep_perf only

style(nixl_ep): fix flake8 E226 and import order

8016df4

- Add spacing around arithmetic operator (world_size - 1) - Fix import order to be alphabetical (nixl_ep after numpy/torch) - Update copyright year to 2025-2026

chore(nixl_ep): update copyright to 2025-2026 and fix import order

35711b1

- Update copyright headers for files modified in 2026 - Fix import order: nixl_ep before numpy/torch (CI requirement)

eranrs mentioned this pull request Jan 8, 2026

Adding tests control plane #1180

Open

itayalroy approved these changes Jan 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add data plane tests for nixl_ep #1109

Add data plane tests for nixl_ep #1109

Uh oh!

eranrs commented Dec 11, 2025

Uh oh!

copy-pr-bot bot commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

itayalroy left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eranrs commented Jan 7, 2026

Uh oh!

itayalroy left a comment

Uh oh!

itayalroy Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add data plane tests for nixl_ep #1109

Are you sure you want to change the base?

Add data plane tests for nixl_ep #1109

Uh oh!

Conversation

eranrs commented Dec 11, 2025

What?

Why?

How?

Uh oh!

copy-pr-bot bot commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

itayalroy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eranrs commented Jan 7, 2026

Uh oh!

itayalroy left a comment

Choose a reason for hiding this comment

Uh oh!

itayalroy Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants