Adding tests control plane #1180

eranrs · 2026-01-08T07:49:20Z

Summary

Adds control plane performance tests with multi-node support.

Changes

Add test_control_plane.py with cycle/reconnect tests
Uses TCPStore by default (--use-etcd to switch)
Skips NIC discovery by default (--discover-nics to enable)
Consistent with data plane test flags and defaults

Testing

✅ Single-node: 8 processes

Dependencies

Built on top of Add data plane tests for nixl_ep #1109 (data plane tests)

Minimal performance test for NIXL EP Buffer data plane operations: - dispatch throughput/latency (dispatch only) - combine throughput/latency (one dispatch, many combines) - e2e throughput/latency (dispatch + combine cycles) Usage: python3 test_data_plane.py --num-processes=8 --mode=e2e Includes: - mp_runner.py: Multi-process test runner with GPU/UCX coordination - rank_server.py: TCP-based coordination server for distributed tests

Adds test_control_plane.py to measure latency of: - Buffer initialization (init) - connect_ranks() - disconnect_ranks() - destroy() - Full cycle (init → connect → disconnect → reconnect → destroy) Supports multiple expert counts per run and both IPC and RDMA backends.

Adds support for using TCPStore instead of etcd for metadata exchange during control plane performance tests. Uses port offset (+1000) to avoid conflicts with torchrun's default port. Usage: python3 test_control_plane.py --use-tcp-store --nvlink-backend=none

When --use-tcp-store is passed, skip the etcd availability check since metadata exchange is done via TCPStore instead.

The use_tcp_store parameter was captured as a named argument but not passed through to the child processes, causing them to default to use_tcp_store=False and hang waiting for etcd metadata.

- Add store_group.py helper from PR 1155 - Start dedicated TCPStore server process in mp_runner - Use create_client_store() in workers (no world_size needed) - Cleaner separation: master server + client workers - Follows ai-dynamo/nixl PR 1155 elastic test pattern

- Add use_tcp_store parameter to setup_worker_environment() - Only set NIXL_ETCD_ENDPOINTS when NOT using TCPStore - This prevents C++ code from activating etcd path when TCPStore is requested - Matches elastic.py implementation from pr-1155 branch - Fixes control plane tests to work with --use-tcp-store flag

- Use standard PyTorch distributed env vars (WORLD_SIZE, RANK, MASTER_ADDR) - CLI flags (--world-size, --rank, --master-addr) override env vars - WORLD_SIZE = number of nodes (not total ranks) - num-processes = GPUs per node - RANK 0 = master node (runs TCPStore and rank server) - RANK > 0 = worker nodes (connect to master) - Validation: error if MASTER_ADDR not set for workers - Calculate global rank = node_rank * procs_per_node + local_rank - Backward compatible: defaults to single-node (WORLD_SIZE=1, RANK=0) Example multi-node usage: # Master (node 0) WORLD_SIZE=2 RANK=0 MASTER_ADDR=node0 python3 test_control_plane.py --num-processes=8 --use-tcp-store # Worker (node 1) WORLD_SIZE=2 RANK=1 MASTER_ADDR=node0 python3 test_control_plane.py --num-processes=8 --use-tcp-store

- Only master node (RANK=0) checks/cleans etcd state - Only master node starts TCPStore server - Only master node starts rank coordination server - Worker nodes connect to master's services - Add clear logging for multi-node master/worker role - Prevents conflicts when multiple nodes try to manage shared services

- Document WORLD_SIZE, RANK, MASTER_ADDR environment variables - Provide examples for master and worker node setup - Explain CLI flag alternatives (--world-size, --rank, --master-addr) - Clarify that WORLD_SIZE = number of nodes, not total ranks - Add note that master node runs TCPStore/rank server - Recommend --use-tcp-store to avoid etcd dependency

- RankClient uses get_rank() not register_rank() - Multi-node: calculate deterministic global_rank = RANK * num_processes + local_rank - Single-node: use server-assigned global_rank (auto-incrementing) - Prevents non-deterministic rank assignment in multi-node setup - Fixes AttributeError: 'RankClient' object has no attribute 'register_rank'

- Spawned worker processes weren't inheriting UCX_TLS from parent shell - Now explicitly set UCX_TLS=rc_mlx5,dc_mlx5,tcp if not already set - Critical for cross-node RDMA communication in multi-node tests

- Set UCX_TLS=rc_mlx5,dc_mlx5,tcp,^cuda_ipc before Buffer creation - Prevents buffer.py from overwriting with only ^cuda_ipc - Ensures RDMA transports are enabled for cross-node communication

- Simplified rank assignment: always use rank_server.get_rank() - Works identically for single-node and multi-node - Follows elastic.py pattern: ranks assigned sequentially (0,1,2,...) - Each Buffer gets unique global_rank for agent name - Ensures unique metadata in etcd/TCPStore - torch_rank used for CUDA device (0-7 per node)

- etcd must be accessible from all nodes via master IP - Auto-detect when etcd_server is localhost but master_addr is not - Override to use http://master_addr:2379 for multi-node - Matches elastic.py pattern: --etcd-server=MASTER_IP:2379 - Critical for cross-node metadata exchange

Buffer.__init__() sets UCX_TLS='^cuda_ipc' when nvlink_backend!='nixl'. This overwrites RDMA transports needed for multi-node communication. UCX initialization happens in update_memory_buffers() -> init() -> _nixl_agent_init(). We must restore RDMA transports AFTER Buffer() but BEFORE update_memory_buffers().

UCX_TLS=^cuda_ipc means 'use all transports except cuda_ipc'. UCX will auto-detect RDMA transports (rc_mlx5, dc_mlx5, tcp). Don't override what buffer.py sets.

Log UCX_NET_DEVICES and NIXL_ETCD_ENDPOINTS right before Buffer creation to verify environment is set correctly for multi-node.

Signed-off-by: Itay Alroy <[email protected]>

This commit migrates elastic.py to TCPStore-based metadata exchange instead of ETCD. Note that this TCPStore instance can be re-used for rank management (instead of the custom TCP based implementation we currently have) but distributed rank management performs worse in scale (tested with 64 concurrent get_rank() calls) so rank server will stay for now. Signed-off-by: Itay Alroy <[email protected]>

Signed-off-by: Itay Alroy <[email protected]>

- Pass tcp_store_group to Buffer() constructor (like elastic.py) - Remove set_tcp_store_group() calls (no longer needed) - Cherry-picked 4 commits from PR 1155: * Skip invalidate MD with TCPStore * Increase wait time for tcp store md exchange * Migrate elastic.py to TCPStore * Use tcp_store's multi_get to fetch all MDs at once

Copy elastic.py pattern: append IPoIB (InfiniBand) TCP interfaces to UCX_NET_DEVICES for cross-node communication fallback. Without TCP NICs, UCX can't establish active messages transport between nodes, causing 'no active messages transport' errors.

- Each node now checks for its own expected global rank range (Master: 0-7, Worker: 8-15, etc.) - Previous bug: All nodes checked for ranks 0-num_processes, causing worker to think master ranks died - Add debug logging to show discovered GPU-NIC topology per node

Make rank assignment consistent with elastic.py: - local_rank: GPU index on this node (0-7) from rank server → Used for GPU/NIC selection (CUDA_VISIBLE_DEVICES, UCX_NET_DEVICES) - global_rank: Unique rank across all nodes (0-15 for 2 nodes) from rank server → Used for Buffer(rank=...) - torch_rank: Spawn index (internal to torch.multiprocessing, not used) Previous bug: Used torch_rank for GPU/NIC selection instead of local_rank, causing GPU 0 to get GPU 4's NIC, etc. This matches the simple, clear pattern in elastic.py.

Show both global rank and local_rank in debug output to verify rank server assignment is working correctly.

…rocess index Previous approach used connection-order rank assignment from rank server, which caused unpredictable rank distribution across nodes. Now: - global_rank = node_rank * num_processes + local_rank - Node 0 gets ranks 0-7, Node 1 gets ranks 8-15, etc. - Removes dependency on rank server for rank assignment (still used for barriers)

Workers now log: - ✓ TCPStore ready at <master>:<port> - ✓ Master is alive! Connected to rank server at <master>:<port>

All log messages from run_multiprocess_test now start with [Node X] where X is the node's rank (0=master, 1-N=workers)

…ages All server readiness polling messages now include the node prefix for consistent multi-node debugging output

- Pass node_rank from mp_runner to test function via kwargs - Configure logger formatter in test function with [Node X] prefix - All output including results tables now prefixed for multi-node debugging

Set logger formatter early in main() so all output including the test configuration header and results tables have the prefix

- Use logger formatter instead of manual prefix in mp_runner.py - Remove log_prefix parameter from wait_for_tcp_port and wait_for_server - All logs now get prefix from formatter automatically

- Add --world-size, --rank, --master-addr parameters - Add --use-tcp-store for TCPStore metadata exchange - Add [Node X] prefix to all log messages - Support TCPStore in _run_data_plane_test function - Consistent with test_control_plane.py interface

The \n at the start of the log message caused the prefix to appear after the newline. Using separate logger.info('') for blank line.

Was using 'from store_group import store_group' which failed. Changed to 'import store_group' to match test_control_plane.py

- Remove double torch.cuda.synchronize() around connect_ranks - Clean all etcd keys instead of just nixl/* prefix - Simplify etcd_server logic to always use master_addr - Remove manual wait_for_tcp_port, use TCPStore built-in timeout - Invert NIC discovery: default=skip, add --discover-nics flag - Invert metadata exchange: default=TCPStore, add --use-etcd flag

- Add wait_for_tcp_port for both master and worker nodes - Prevents race condition where workers try to connect before server is ready - Fixes timeout errors in single-node TCPStore tests

- Store returned object in variable to prevent Python GC - Without this, TCPStore server would immediately shut down - Fixes 'TCP port not ready' timeout error

- Add noqa comment for intentionally unused _store variable - Variable must be kept alive to prevent garbage collection

Focus PR scope on test/python/nixl_ep_perf only

- Add spacing around arithmetic operator (world_size - 1) - Fix import order to be alphabetical (nixl_ep after numpy/torch) - Update copyright year to 2025-2026

- Update copyright headers for files modified in 2026 - Fix import order: nixl_ep before numpy/torch (CI requirement)

- Add test_control_plane.py for control plane performance testing - Use TCPStore by default instead of etcd (--use-etcd to opt-in) - Skip NIC discovery by default (--discover-nics to opt-in) - Aligned with test_data_plane.py flag conventions - Copyright 2025-2026

copy-pr-bot · 2026-01-08T07:49:23Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-01-08T07:49:28Z

👋 Hi eranrs! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

…ters

eranrs and others added 30 commits December 30, 2025 14:25

fix(test): skip etcd check when using TCPStore

1561aa7

When --use-tcp-store is passed, skip the etcd availability check since metadata exchange is done via TCPStore instead.

fix(test): pass use_tcp_store to test function via kwargs

6b12e37

The use_tcp_store parameter was captured as a named argument but not passed through to the child processes, causing them to default to use_tcp_store=False and hang waiting for etcd metadata.

fix(test): inherit UCX_TLS env var for RDMA multi-node support

f7422bb

- Spawned worker processes weren't inheriting UCX_TLS from parent shell - Now explicitly set UCX_TLS=rc_mlx5,dc_mlx5,tcp if not already set - Critical for cross-node RDMA communication in multi-node tests

fix(test): pre-set UCX_TLS with RDMA transports and cuda_ipc exclusion

30f8017

- Set UCX_TLS=rc_mlx5,dc_mlx5,tcp,^cuda_ipc before Buffer creation - Prevents buffer.py from overwriting with only ^cuda_ipc - Ensures RDMA transports are enabled for cross-node communication

debug: add UCX_TLS logging to diagnose metadata issue

063f871

Add debug logging to track UCX_TLS and Buffer parameters

078cd18

Add debug logging to trace NIXL_ETCD_ENDPOINTS setting

78adf99

Revert UCX_TLS explicit setting - let buffer.py handle it

dac7f9f

UCX_TLS=^cuda_ipc means 'use all transports except cuda_ipc'. UCX will auto-detect RDMA transports (rc_mlx5, dc_mlx5, tcp). Don't override what buffer.py sets.

Add debug logging following elastic.py pattern

5ae9dfa

Log UCX_NET_DEVICES and NIXL_ETCD_ENDPOINTS right before Buffer creation to verify environment is set correctly for multi-node.

nixl_ep: Skip invalidate MD with TCPStore

8c44e27

Signed-off-by: Itay Alroy <[email protected]>

nixl_ep: Increase wait time for tcp store md exchange

dc5104a

Signed-off-by: Itay Alroy <[email protected]>

Use tcp_store's multi_get to fetch all MDs at once

f1ecff6

Signed-off-by: Itay Alroy <[email protected]>

Add local_rank to debug logging

022c0a5

Show both global rank and local_rank in debug output to verify rank server assignment is working correctly.

eranrs added 20 commits January 5, 2026 13:44

Add explicit logging when workers confirm master is alive

f014727

Workers now log: - ✓ TCPStore ready at <master>:<port> - ✓ Master is alive! Connected to rank server at <master>:<port>

Add [Node X] prefix to all log messages for easier multi-node debugging

b28c647

All log messages from run_multiprocess_test now start with [Node X] where X is the node's rank (0=master, 1-N=workers)

Add [Node X] prefix to wait_for_tcp_port and wait_for_server log mess…

67bd73b

…ages All server readiness polling messages now include the node prefix for consistent multi-node debugging output

Add [Node X] prefix to all test results and headers

9e8cffd

- Pass node_rank from mp_runner to test function via kwargs - Configure logger formatter in test function with [Node X] prefix - All output including results tables now prefixed for multi-node debugging

Add [Node X] prefix to main() header and results output

f15b3cc

Set logger formatter early in main() so all output including the test configuration header and results tables have the prefix

Fix double [Node X] prefix in log messages

ebc5f41

- Use logger formatter instead of manual prefix in mp_runner.py - Remove log_prefix parameter from wait_for_tcp_port and wait_for_server - All logs now get prefix from formatter automatically

Fix missing [Node X] prefix on 'Running:' log message

a18161d

The \n at the start of the log message caused the prefix to appear after the newline. Using separate logger.info('') for blank line.

Fix store_group import in test_data_plane.py

7444227

Was using 'from store_group import store_group' which failed. Changed to 'import store_group' to match test_control_plane.py

Remove test_control_plane.py - focus PR on data plane tests only

901a542

Fix pre-commit issues: isort, flake8 (unused imports/variables)

aa78c1d

fix(nixl_ep): ensure TCPStore server is ready before spawning workers

f119a64

- Add wait_for_tcp_port for both master and worker nodes - Prevents race condition where workers try to connect before server is ready - Fixes timeout errors in single-node TCPStore tests

fix(nixl_ep): keep TCPStore object alive to prevent garbage collection

58a7f77

- Store returned object in variable to prevent Python GC - Without this, TCPStore server would immediately shut down - Fixes 'TCP port not ready' timeout error

style(nixl_ep): fix flake8 unused variable warning

4c46c51

- Add noqa comment for intentionally unused _store variable - Variable must be kept alive to prevent garbage collection

chore: revert changes to examples/ and pyproject.toml

f71f691

Focus PR scope on test/python/nixl_ep_perf only

style(nixl_ep): fix flake8 E226 and import order

8016df4

- Add spacing around arithmetic operator (world_size - 1) - Fix import order to be alphabetical (nixl_ep after numpy/torch) - Update copyright year to 2025-2026

chore(nixl_ep): update copyright to 2025-2026 and fix import order

35711b1

- Update copyright headers for files modified in 2026 - Fix import order: nixl_ep before numpy/torch (CI requirement)

eranrs requested a review from a team as a code owner January 8, 2026 07:49

pull-request-size bot added the size/XXL label Jan 8, 2026

github-actions bot added the external-contribution label Jan 8, 2026

eranrs added 2 commits January 8, 2026 09:52

style(nixl_ep): fix isort and flake8 in test_control_plane.py

6ff4ef2

Add --num-tokens and --hidden CLI parameters to test_control_plane.py

53aa13b

eranrs force-pushed the adding_tests_control_plane branch from a1b1b0b to 3e59d8a Compare January 14, 2026 08:06

style(nixl_ep): fix black and isort formatting for buffer size parame…

d1564ba

…ters

eranrs force-pushed the adding_tests_control_plane branch from 3e59d8a to d1564ba Compare January 14, 2026 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding tests control plane #1180

Adding tests control plane #1180

Uh oh!

eranrs commented Jan 8, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding tests control plane #1180

Are you sure you want to change the base?

Adding tests control plane #1180

Uh oh!

Conversation

eranrs commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Dependencies

Uh oh!

copy-pr-bot bot commented Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eranrs commented Jan 8, 2026 •

edited

Loading