-
Notifications
You must be signed in to change notification settings - Fork 225
Add data plane tests for nixl_ep #1109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
eranrs
wants to merge
55
commits into
ai-dynamo:main
Choose a base branch
from
eranrs:adding_tests
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,550
−0
Open
Changes from all commits
Commits
Show all changes
55 commits
Select commit
Hold shift + click to select a range
e6a6e68
test(nixl_ep): add data plane performance test
eranrs c72c503
test(nixl_ep): add control plane performance test
eranrs 038f604
feat(test): add --use-tcp-store flag to control plane test
eranrs 1561aa7
fix(test): skip etcd check when using TCPStore
eranrs 6b12e37
fix(test): pass use_tcp_store to test function via kwargs
eranrs 13bb42b
refactor(test): adopt PR 1155 TCPStore pattern
eranrs 7bdfd4f
Fix TCPStore metadata exchange in mp_runner
eranrs 84f31ac
feat(test): add multi-node support using WORLD_SIZE and RANK env vars
eranrs 4c96a1b
fix(test): ensure only master node handles etcd/TCPStore/rank-server
eranrs 8ec076f
docs(test): add multi-node usage documentation
eranrs ff3e846
fix(test): use correct RankClient API for rank registration
eranrs f7422bb
fix(test): inherit UCX_TLS env var for RDMA multi-node support
eranrs 30f8017
fix(test): pre-set UCX_TLS with RDMA transports and cuda_ipc exclusion
eranrs 063f871
debug: add UCX_TLS logging to diagnose metadata issue
eranrs 67aa3c7
fix(test): always use rank server for rank assignment
eranrs b7183ae
fix(test): use master_addr for etcd in multi-node setup
eranrs 078cd18
Add debug logging to track UCX_TLS and Buffer parameters
eranrs 78adf99
Add debug logging to trace NIXL_ETCD_ENDPOINTS setting
eranrs e7b5f66
CRITICAL FIX: Restore UCX_TLS after Buffer.__init__() for multi-node
eranrs dac7f9f
Revert UCX_TLS explicit setting - let buffer.py handle it
eranrs 5ae9dfa
Add debug logging following elastic.py pattern
eranrs 8c44e27
nixl_ep: Skip invalidate MD with TCPStore
itayalroy dc5104a
nixl_ep: Increase wait time for tcp store md exchange
itayalroy 6385ded
nixl_ep: Migrate elastic.py to TCPStore
itayalroy f1ecff6
Use tcp_store's multi_get to fetch all MDs at once
itayalroy 5b1ced3
Align control plane tests with elastic.py pattern (PR 1155)
eranrs 917b0e6
Add TCP fallback NICs to UCX_NET_DEVICES for cross-node
eranrs 424b233
Fix multi-node result collection and add topology debug logging
eranrs abb067e
Simplify rank naming to match elastic.py pattern
eranrs 022c0a5
Add local_rank to debug logging
eranrs 872ef48
Fix NameError: pass local_rank to _run_single_op
eranrs 9514a96
Show per-node results instead of master-only
eranrs 348fa6c
Add 60s retry/wait for workers to connect to master services
eranrs 42fbacc
Revert "Add 60s retry/wait for workers to connect to master services"
eranrs b312444
Add polling-based wait for worker nodes to connect to master servers
eranrs 5a77bd0
Fix: Workers should not clear barriers - only master should
eranrs b9905f1
Fix: Use deterministic rank assignment based on node_rank and local p…
eranrs f014727
Add explicit logging when workers confirm master is alive
eranrs b28c647
Add [Node X] prefix to all log messages for easier multi-node debugging
eranrs 67bd73b
Add [Node X] prefix to wait_for_tcp_port and wait_for_server log mess…
eranrs 9e8cffd
Add [Node X] prefix to all test results and headers
eranrs f15b3cc
Add [Node X] prefix to main() header and results output
eranrs ebc5f41
Fix double [Node X] prefix in log messages
eranrs 43a1420
Add multi-node support to test_data_plane.py
eranrs a18161d
Fix missing [Node X] prefix on 'Running:' log message
eranrs 7444227
Fix store_group import in test_data_plane.py
eranrs 901a542
Remove test_control_plane.py - focus PR on data plane tests only
eranrs aa78c1d
Fix pre-commit issues: isort, flake8 (unused imports/variables)
eranrs f459141
test(nixl_ep): address PR review comments
eranrs f119a64
fix(nixl_ep): ensure TCPStore server is ready before spawning workers
eranrs 58a7f77
fix(nixl_ep): keep TCPStore object alive to prevent garbage collection
eranrs 4c46c51
style(nixl_ep): fix flake8 unused variable warning
eranrs f71f691
chore: revert changes to examples/ and pyproject.toml
eranrs 8016df4
style(nixl_ep): fix flake8 E226 and import order
eranrs 35711b1
chore(nixl_ep): update copyright to 2025-2026 and fix import order
eranrs File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,164 @@ | ||
| # NIXL EP Performance Tests | ||
|
|
||
| Performance tests for NIXL EP Buffer: | ||
| - **Data Plane**: dispatch/combine throughput and latency | ||
| - **Control Plane**: init/connect/disconnect/destroy latency | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - etcd running locally (`etcd &` or `source /workspace/nixl/examples/device/ep/scripts/reset_etcd.sh`) | ||
| - CUDA device with RDMA support | ||
|
|
||
| ## Environment Setup | ||
|
|
||
| ```bash | ||
| # For RDMA performance (recommended) | ||
| export UCX_TLS=rc_mlx5,dc_mlx5,tcp | ||
| export UCX_IB_AR_ENABLE=no # Disable Adaptive Routing for consistent performance | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| ```bash | ||
| cd test/python/nixl_ep_perf | ||
|
|
||
| # IPC/NVLink backend (default) | ||
| python3 test_data_plane.py --num-processes=8 --mode=e2e | ||
|
|
||
| # RDMA-only (disable NVLink) | ||
| UCX_TLS=rc_mlx5,dc_mlx5,tcp UCX_IB_AR_ENABLE=no \ | ||
| python3 test_data_plane.py --num-processes=8 --mode=e2e --nvlink-backend none | ||
|
|
||
| # Dispatch only (measures dispatch throughput) | ||
| python3 test_data_plane.py --num-processes=8 --mode=dispatch | ||
|
|
||
| # Combine only (one dispatch, many combines) | ||
| python3 test_data_plane.py --num-processes=8 --mode=combine | ||
| ``` | ||
|
|
||
| ## Options | ||
|
|
||
| | Flag | Default | Description | | ||
| |------|---------|-------------| | ||
| | `--num-processes` | 8 | Number of ranks/GPUs | | ||
| | `--mode` | e2e | Test mode: dispatch, combine, e2e | | ||
| | `--tokens` | 512 | Number of tokens | | ||
| | `--hidden` | 4096 | Hidden dimension | | ||
| | `--experts-per-rank` | 8 | Experts per rank | | ||
| | `--topk` | 2 | TopK value | | ||
| | `--nvlink-backend` | ipc | Backend: ipc, nixl, none (RDMA only) | | ||
| | `--warmup` | 10 | Warmup iterations | | ||
| | `--iters` | 100 | Measurement iterations | | ||
| | `--discover-nics` | false | Enable GPU-NIC topology discovery (default: disabled, UCX auto-selects) | | ||
| | `--use-etcd` | false | Use etcd for metadata exchange (default: TCPStore) | | ||
|
|
||
| ## Example Output | ||
|
|
||
| ``` | ||
| ====================================================================== | ||
| NIXL EP Data Plane Performance Test | ||
| ====================================================================== | ||
| Mode: e2e | ||
| Ranks: 8, Tokens: 128, Hidden: 7168 | ||
| Experts: 36/rank (288 total), TopK: 8 | ||
| Backend: none (RDMA forced) | ||
| Warmup: 10, Measure: 100 iterations | ||
| ====================================================================== | ||
| ====================================================================== | ||
| Data Plane (e2e): 8/8 ranks passed | ||
| ====================================================================== | ||
| Bandwidth (GB/s): avg=42.88, min=42.86, max=42.89 | ||
| Latency (μs): avg=519.3, min=519.1, max=519.5 | ||
| ``` | ||
|
|
||
| ## Expected Performance (DFW cluster, RDMA, AR=no) | ||
|
|
||
| | Mode | Bandwidth | Latency | | ||
| |------|-----------|---------| | ||
| | E2E | ~42.8 GB/s | ~520 μs | | ||
| | Dispatch | ~42.1 GB/s | ~180 μs | | ||
| | Combine | ~43.3 GB/s | ~340 μs | | ||
|
|
||
| *Config: 128 tokens, 7168 hidden, topk=8, 288 experts (36/rank), 8 GPUs* | ||
|
|
||
| ## Control Plane Tests | ||
|
|
||
| Measures latency of control plane operations (init, connect, disconnect, destroy). | ||
|
|
||
| ### Single-Node (Default) | ||
|
|
||
| ```bash | ||
| # Full cycle (init → connect → disconnect → reconnect → destroy) | ||
| python3 test_control_plane.py --num-processes=8 | ||
|
|
||
| # Specific expert counts | ||
| python3 test_control_plane.py --num-processes=8 --experts-per-rank=8,32 | ||
|
|
||
| # Single operation | ||
| python3 test_control_plane.py --num-processes=8 --test=connect | ||
|
|
||
| # Use etcd instead of TCPStore (if needed) | ||
| python3 test_control_plane.py --num-processes=8 --use-etcd | ||
| ``` | ||
|
|
||
| ### Multi-Node Setup | ||
|
|
||
| Use environment variables `WORLD_SIZE`, `RANK`, and `MASTER_ADDR` for multi-node testing: | ||
|
|
||
| **Master Node (RANK=0):** | ||
| ```bash | ||
| WORLD_SIZE=2 RANK=0 MASTER_ADDR=node0.example.com \ | ||
| python3 test_control_plane.py --num-processes=8 | ||
| ``` | ||
|
|
||
| **Worker Node (RANK=1):** | ||
| ```bash | ||
| WORLD_SIZE=2 RANK=1 MASTER_ADDR=node0.example.com \ | ||
| python3 test_control_plane.py --num-processes=8 | ||
| ``` | ||
|
|
||
| **Or use CLI flags:** | ||
| ```bash | ||
| # Master | ||
| python3 test_control_plane.py --num-processes=8 --world-size=2 --rank=0 --master-addr=node0 | ||
|
|
||
| # Worker | ||
| python3 test_control_plane.py --num-processes=8 --world-size=2 --rank=1 --master-addr=node0 | ||
| ``` | ||
|
|
||
| **Key points:** | ||
| - `WORLD_SIZE` = number of nodes (not total ranks) | ||
| - `--num-processes` = GPUs per node | ||
| - Total ranks = WORLD_SIZE × num-processes | ||
| - RANK=0 is master (runs TCPStore/rank server) | ||
| - RANK>0 are workers (connect to master) | ||
| - TCPStore is used by default (no etcd dependency); use `--use-etcd` to switch to etcd | ||
|
|
||
| ### Example Output | ||
|
|
||
| ``` | ||
| ====================================================================== | ||
| Control Plane: 8 experts/rank x 8 ranks = 64 total | ||
| ====================================================================== | ||
| Operation Avg (ms) Min (ms) Max (ms) | ||
| ---------------------------------------------------------------------- | ||
| init 150.23 148.15 152.31 | ||
| connect 245.67 242.33 248.91 | ||
| disconnect 12.45 11.23 13.67 | ||
| reconnect 198.34 195.12 201.56 | ||
| destroy 85.12 83.45 86.79 | ||
| ---------------------------------------------------------------------- | ||
| TOTAL 691.81 | ||
| ====================================================================== | ||
| ``` | ||
|
|
||
| ## Files | ||
|
|
||
| | File | Description | | ||
| |------|-------------| | ||
| | `test_data_plane.py` | Data plane test (dispatch/combine/e2e) | | ||
| | `test_control_plane.py` | Control plane test (init/connect/disconnect/destroy) | | ||
| | `mp_runner.py` | Multi-process test runner | | ||
| | `rank_server.py` | Coordination server for distributed tests | | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| """NIXL EP data plane performance tests.""" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.