Skip to content

Conversation

@eranrs
Copy link
Contributor

@eranrs eranrs commented Jan 8, 2026

Summary

Adds control plane performance tests with multi-node support.

Changes

  • Add test_control_plane.py with cycle/reconnect tests
  • Uses TCPStore by default (--use-etcd to switch)
  • Skips NIC discovery by default (--discover-nics to enable)
  • Consistent with data plane test flags and defaults

Testing

  • ✅ Single-node: 8 processes

Dependencies

eranrs and others added 30 commits December 30, 2025 14:25
Minimal performance test for NIXL EP Buffer data plane operations:
- dispatch throughput/latency (dispatch only)
- combine throughput/latency (one dispatch, many combines)
- e2e throughput/latency (dispatch + combine cycles)

Usage: python3 test_data_plane.py --num-processes=8 --mode=e2e

Includes:
- mp_runner.py: Multi-process test runner with GPU/UCX coordination
- rank_server.py: TCP-based coordination server for distributed tests
Adds test_control_plane.py to measure latency of:
- Buffer initialization (init)
- connect_ranks()
- disconnect_ranks()
- destroy()
- Full cycle (init → connect → disconnect → reconnect → destroy)

Supports multiple expert counts per run and both IPC and RDMA backends.
Adds support for using TCPStore instead of etcd for metadata exchange
during control plane performance tests. Uses port offset (+1000) to
avoid conflicts with torchrun's default port.

Usage: python3 test_control_plane.py --use-tcp-store --nvlink-backend=none
When --use-tcp-store is passed, skip the etcd availability check
since metadata exchange is done via TCPStore instead.
The use_tcp_store parameter was captured as a named argument but not
passed through to the child processes, causing them to default to
use_tcp_store=False and hang waiting for etcd metadata.
- Add store_group.py helper from PR 1155
- Start dedicated TCPStore server process in mp_runner
- Use create_client_store() in workers (no world_size needed)
- Cleaner separation: master server + client workers
- Follows ai-dynamo/nixl PR 1155 elastic test pattern
- Add use_tcp_store parameter to setup_worker_environment()
- Only set NIXL_ETCD_ENDPOINTS when NOT using TCPStore
- This prevents C++ code from activating etcd path when TCPStore is requested
- Matches elastic.py implementation from pr-1155 branch
- Fixes control plane tests to work with --use-tcp-store flag
- Use standard PyTorch distributed env vars (WORLD_SIZE, RANK, MASTER_ADDR)
- CLI flags (--world-size, --rank, --master-addr) override env vars
- WORLD_SIZE = number of nodes (not total ranks)
- num-processes = GPUs per node
- RANK 0 = master node (runs TCPStore and rank server)
- RANK > 0 = worker nodes (connect to master)
- Validation: error if MASTER_ADDR not set for workers
- Calculate global rank = node_rank * procs_per_node + local_rank
- Backward compatible: defaults to single-node (WORLD_SIZE=1, RANK=0)

Example multi-node usage:
  # Master (node 0)
  WORLD_SIZE=2 RANK=0 MASTER_ADDR=node0 python3 test_control_plane.py --num-processes=8 --use-tcp-store

  # Worker (node 1)
  WORLD_SIZE=2 RANK=1 MASTER_ADDR=node0 python3 test_control_plane.py --num-processes=8 --use-tcp-store
- Only master node (RANK=0) checks/cleans etcd state
- Only master node starts TCPStore server
- Only master node starts rank coordination server
- Worker nodes connect to master's services
- Add clear logging for multi-node master/worker role
- Prevents conflicts when multiple nodes try to manage shared services
- Document WORLD_SIZE, RANK, MASTER_ADDR environment variables
- Provide examples for master and worker node setup
- Explain CLI flag alternatives (--world-size, --rank, --master-addr)
- Clarify that WORLD_SIZE = number of nodes, not total ranks
- Add note that master node runs TCPStore/rank server
- Recommend --use-tcp-store to avoid etcd dependency
- RankClient uses get_rank() not register_rank()
- Multi-node: calculate deterministic global_rank = RANK * num_processes + local_rank
- Single-node: use server-assigned global_rank (auto-incrementing)
- Prevents non-deterministic rank assignment in multi-node setup
- Fixes AttributeError: 'RankClient' object has no attribute 'register_rank'
- Spawned worker processes weren't inheriting UCX_TLS from parent shell
- Now explicitly set UCX_TLS=rc_mlx5,dc_mlx5,tcp if not already set
- Critical for cross-node RDMA communication in multi-node tests
- Set UCX_TLS=rc_mlx5,dc_mlx5,tcp,^cuda_ipc before Buffer creation
- Prevents buffer.py from overwriting with only ^cuda_ipc
- Ensures RDMA transports are enabled for cross-node communication
- Simplified rank assignment: always use rank_server.get_rank()
- Works identically for single-node and multi-node
- Follows elastic.py pattern: ranks assigned sequentially (0,1,2,...)
- Each Buffer gets unique global_rank for agent name
- Ensures unique metadata in etcd/TCPStore
- torch_rank used for CUDA device (0-7 per node)
- etcd must be accessible from all nodes via master IP
- Auto-detect when etcd_server is localhost but master_addr is not
- Override to use http://master_addr:2379 for multi-node
- Matches elastic.py pattern: --etcd-server=MASTER_IP:2379
- Critical for cross-node metadata exchange
Buffer.__init__() sets UCX_TLS='^cuda_ipc' when nvlink_backend!='nixl'.
This overwrites RDMA transports needed for multi-node communication.
UCX initialization happens in update_memory_buffers() -> init() -> _nixl_agent_init().
We must restore RDMA transports AFTER Buffer() but BEFORE update_memory_buffers().
UCX_TLS=^cuda_ipc means 'use all transports except cuda_ipc'.
UCX will auto-detect RDMA transports (rc_mlx5, dc_mlx5, tcp).
Don't override what buffer.py sets.
Log UCX_NET_DEVICES and NIXL_ETCD_ENDPOINTS right before Buffer creation
to verify environment is set correctly for multi-node.
This commit migrates elastic.py to TCPStore-based metadata exchange
instead of ETCD.

Note that this TCPStore instance can be re-used for rank management
(instead of the custom TCP based implementation we currently have)
but distributed rank management performs worse in scale (tested
with 64 concurrent get_rank() calls) so rank server will stay
for now.

Signed-off-by: Itay Alroy <[email protected]>
- Pass tcp_store_group to Buffer() constructor (like elastic.py)
- Remove set_tcp_store_group() calls (no longer needed)
- Cherry-picked 4 commits from PR 1155:
  * Skip invalidate MD with TCPStore
  * Increase wait time for tcp store md exchange
  * Migrate elastic.py to TCPStore
  * Use tcp_store's multi_get to fetch all MDs at once
Copy elastic.py pattern: append IPoIB (InfiniBand) TCP interfaces
to UCX_NET_DEVICES for cross-node communication fallback.

Without TCP NICs, UCX can't establish active messages transport
between nodes, causing 'no active messages transport' errors.
- Each node now checks for its own expected global rank range
  (Master: 0-7, Worker: 8-15, etc.)
- Previous bug: All nodes checked for ranks 0-num_processes, causing
  worker to think master ranks died
- Add debug logging to show discovered GPU-NIC topology per node
Make rank assignment consistent with elastic.py:
- local_rank: GPU index on this node (0-7) from rank server
  → Used for GPU/NIC selection (CUDA_VISIBLE_DEVICES, UCX_NET_DEVICES)
- global_rank: Unique rank across all nodes (0-15 for 2 nodes) from rank server
  → Used for Buffer(rank=...)
- torch_rank: Spawn index (internal to torch.multiprocessing, not used)

Previous bug: Used torch_rank for GPU/NIC selection instead of local_rank,
causing GPU 0 to get GPU 4's NIC, etc.

This matches the simple, clear pattern in elastic.py.
Show both global rank and local_rank in debug output to verify
rank server assignment is working correctly.
eranrs added 20 commits January 5, 2026 13:44
…rocess index

Previous approach used connection-order rank assignment from rank server,
which caused unpredictable rank distribution across nodes. Now:
- global_rank = node_rank * num_processes + local_rank
- Node 0 gets ranks 0-7, Node 1 gets ranks 8-15, etc.
- Removes dependency on rank server for rank assignment (still used for barriers)
Workers now log:
- ✓ TCPStore ready at <master>:<port>
- ✓ Master is alive! Connected to rank server at <master>:<port>
All log messages from run_multiprocess_test now start with [Node X]
where X is the node's rank (0=master, 1-N=workers)
…ages

All server readiness polling messages now include the node prefix for
consistent multi-node debugging output
- Pass node_rank from mp_runner to test function via kwargs
- Configure logger formatter in test function with [Node X] prefix
- All output including results tables now prefixed for multi-node debugging
Set logger formatter early in main() so all output including
the test configuration header and results tables have the prefix
- Use logger formatter instead of manual prefix in mp_runner.py
- Remove log_prefix parameter from wait_for_tcp_port and wait_for_server
- All logs now get prefix from formatter automatically
- Add --world-size, --rank, --master-addr parameters
- Add --use-tcp-store for TCPStore metadata exchange
- Add [Node X] prefix to all log messages
- Support TCPStore in _run_data_plane_test function
- Consistent with test_control_plane.py interface
The \n at the start of the log message caused the prefix to appear
after the newline. Using separate logger.info('') for blank line.
Was using 'from store_group import store_group' which failed.
Changed to 'import store_group' to match test_control_plane.py
- Remove double torch.cuda.synchronize() around connect_ranks
- Clean all etcd keys instead of just nixl/* prefix
- Simplify etcd_server logic to always use master_addr
- Remove manual wait_for_tcp_port, use TCPStore built-in timeout
- Invert NIC discovery: default=skip, add --discover-nics flag
- Invert metadata exchange: default=TCPStore, add --use-etcd flag
- Add wait_for_tcp_port for both master and worker nodes
- Prevents race condition where workers try to connect before server is ready
- Fixes timeout errors in single-node TCPStore tests
- Store returned object in variable to prevent Python GC
- Without this, TCPStore server would immediately shut down
- Fixes 'TCP port not ready' timeout error
- Add noqa comment for intentionally unused _store variable
- Variable must be kept alive to prevent garbage collection
Focus PR scope on test/python/nixl_ep_perf only
- Add spacing around arithmetic operator (world_size - 1)
- Fix import order to be alphabetical (nixl_ep after numpy/torch)
- Update copyright year to 2025-2026
- Update copyright headers for files modified in 2026
- Fix import order: nixl_ep before numpy/torch (CI requirement)
- Add test_control_plane.py for control plane performance testing
- Use TCPStore by default instead of etcd (--use-etcd to opt-in)
- Skip NIC discovery by default (--discover-nics to opt-in)
- Aligned with test_data_plane.py flag conventions
- Copyright 2025-2026
@eranrs eranrs requested a review from a team as a code owner January 8, 2026 07:49
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

github-actions bot commented Jan 8, 2026

👋 Hi eranrs! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@eranrs eranrs force-pushed the adding_tests_control_plane branch from a1b1b0b to 3e59d8a Compare January 14, 2026 08:06
@eranrs eranrs force-pushed the adding_tests_control_plane branch from 3e59d8a to d1564ba Compare January 14, 2026 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants