Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
e6a6e68
test(nixl_ep): add data plane performance test
eranrs Dec 30, 2025
c72c503
test(nixl_ep): add control plane performance test
eranrs Dec 30, 2025
038f604
feat(test): add --use-tcp-store flag to control plane test
eranrs Dec 31, 2025
1561aa7
fix(test): skip etcd check when using TCPStore
eranrs Dec 31, 2025
6b12e37
fix(test): pass use_tcp_store to test function via kwargs
eranrs Dec 31, 2025
13bb42b
refactor(test): adopt PR 1155 TCPStore pattern
eranrs Dec 31, 2025
7bdfd4f
Fix TCPStore metadata exchange in mp_runner
eranrs Jan 1, 2026
84f31ac
feat(test): add multi-node support using WORLD_SIZE and RANK env vars
eranrs Jan 4, 2026
4c96a1b
fix(test): ensure only master node handles etcd/TCPStore/rank-server
eranrs Jan 4, 2026
8ec076f
docs(test): add multi-node usage documentation
eranrs Jan 4, 2026
ff3e846
fix(test): use correct RankClient API for rank registration
eranrs Jan 4, 2026
f7422bb
fix(test): inherit UCX_TLS env var for RDMA multi-node support
eranrs Jan 4, 2026
30f8017
fix(test): pre-set UCX_TLS with RDMA transports and cuda_ipc exclusion
eranrs Jan 4, 2026
063f871
debug: add UCX_TLS logging to diagnose metadata issue
eranrs Jan 4, 2026
67aa3c7
fix(test): always use rank server for rank assignment
eranrs Jan 4, 2026
b7183ae
fix(test): use master_addr for etcd in multi-node setup
eranrs Jan 4, 2026
078cd18
Add debug logging to track UCX_TLS and Buffer parameters
eranrs Jan 4, 2026
78adf99
Add debug logging to trace NIXL_ETCD_ENDPOINTS setting
eranrs Jan 4, 2026
e7b5f66
CRITICAL FIX: Restore UCX_TLS after Buffer.__init__() for multi-node
eranrs Jan 4, 2026
dac7f9f
Revert UCX_TLS explicit setting - let buffer.py handle it
eranrs Jan 4, 2026
5ae9dfa
Add debug logging following elastic.py pattern
eranrs Jan 4, 2026
8c44e27
nixl_ep: Skip invalidate MD with TCPStore
itayalroy Dec 23, 2025
dc5104a
nixl_ep: Increase wait time for tcp store md exchange
itayalroy Dec 24, 2025
6385ded
nixl_ep: Migrate elastic.py to TCPStore
itayalroy Dec 25, 2025
f1ecff6
Use tcp_store's multi_get to fetch all MDs at once
itayalroy Dec 25, 2025
5b1ced3
Align control plane tests with elastic.py pattern (PR 1155)
eranrs Jan 5, 2026
917b0e6
Add TCP fallback NICs to UCX_NET_DEVICES for cross-node
eranrs Jan 5, 2026
424b233
Fix multi-node result collection and add topology debug logging
eranrs Jan 5, 2026
abb067e
Simplify rank naming to match elastic.py pattern
eranrs Jan 5, 2026
022c0a5
Add local_rank to debug logging
eranrs Jan 5, 2026
872ef48
Fix NameError: pass local_rank to _run_single_op
eranrs Jan 5, 2026
9514a96
Show per-node results instead of master-only
eranrs Jan 5, 2026
348fa6c
Add 60s retry/wait for workers to connect to master services
eranrs Jan 5, 2026
42fbacc
Revert "Add 60s retry/wait for workers to connect to master services"
eranrs Jan 5, 2026
b312444
Add polling-based wait for worker nodes to connect to master servers
eranrs Jan 5, 2026
5a77bd0
Fix: Workers should not clear barriers - only master should
eranrs Jan 5, 2026
b9905f1
Fix: Use deterministic rank assignment based on node_rank and local p…
eranrs Jan 5, 2026
f014727
Add explicit logging when workers confirm master is alive
eranrs Jan 5, 2026
b28c647
Add [Node X] prefix to all log messages for easier multi-node debugging
eranrs Jan 5, 2026
67bd73b
Add [Node X] prefix to wait_for_tcp_port and wait_for_server log mess…
eranrs Jan 5, 2026
9e8cffd
Add [Node X] prefix to all test results and headers
eranrs Jan 5, 2026
f15b3cc
Add [Node X] prefix to main() header and results output
eranrs Jan 5, 2026
ebc5f41
Fix double [Node X] prefix in log messages
eranrs Jan 5, 2026
43a1420
Add multi-node support to test_data_plane.py
eranrs Jan 5, 2026
a18161d
Fix missing [Node X] prefix on 'Running:' log message
eranrs Jan 5, 2026
7444227
Fix store_group import in test_data_plane.py
eranrs Jan 5, 2026
901a542
Remove test_control_plane.py - focus PR on data plane tests only
eranrs Jan 5, 2026
aa78c1d
Fix pre-commit issues: isort, flake8 (unused imports/variables)
eranrs Jan 5, 2026
f459141
test(nixl_ep): address PR review comments
eranrs Jan 7, 2026
f119a64
fix(nixl_ep): ensure TCPStore server is ready before spawning workers
eranrs Jan 7, 2026
58a7f77
fix(nixl_ep): keep TCPStore object alive to prevent garbage collection
eranrs Jan 7, 2026
4c46c51
style(nixl_ep): fix flake8 unused variable warning
eranrs Jan 7, 2026
f71f691
chore: revert changes to examples/ and pyproject.toml
eranrs Jan 7, 2026
8016df4
style(nixl_ep): fix flake8 E226 and import order
eranrs Jan 7, 2026
35711b1
chore(nixl_ep): update copyright to 2025-2026 and fix import order
eranrs Jan 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 164 additions & 0 deletions test/python/nixl_ep_perf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# NIXL EP Performance Tests

Performance tests for NIXL EP Buffer:
- **Data Plane**: dispatch/combine throughput and latency
- **Control Plane**: init/connect/disconnect/destroy latency

## Prerequisites

- etcd running locally (`etcd &` or `source /workspace/nixl/examples/device/ep/scripts/reset_etcd.sh`)
- CUDA device with RDMA support

## Environment Setup

```bash
# For RDMA performance (recommended)
export UCX_TLS=rc_mlx5,dc_mlx5,tcp
export UCX_IB_AR_ENABLE=no # Disable Adaptive Routing for consistent performance
```

## Usage

```bash
cd test/python/nixl_ep_perf

# IPC/NVLink backend (default)
python3 test_data_plane.py --num-processes=8 --mode=e2e

# RDMA-only (disable NVLink)
UCX_TLS=rc_mlx5,dc_mlx5,tcp UCX_IB_AR_ENABLE=no \
python3 test_data_plane.py --num-processes=8 --mode=e2e --nvlink-backend none

# Dispatch only (measures dispatch throughput)
python3 test_data_plane.py --num-processes=8 --mode=dispatch

# Combine only (one dispatch, many combines)
python3 test_data_plane.py --num-processes=8 --mode=combine
```

## Options

| Flag | Default | Description |
|------|---------|-------------|
| `--num-processes` | 8 | Number of ranks/GPUs |
| `--mode` | e2e | Test mode: dispatch, combine, e2e |
| `--tokens` | 512 | Number of tokens |
| `--hidden` | 4096 | Hidden dimension |
| `--experts-per-rank` | 8 | Experts per rank |
| `--topk` | 2 | TopK value |
| `--nvlink-backend` | ipc | Backend: ipc, nixl, none (RDMA only) |
| `--warmup` | 10 | Warmup iterations |
| `--iters` | 100 | Measurement iterations |
| `--discover-nics` | false | Enable GPU-NIC topology discovery (default: disabled, UCX auto-selects) |
| `--use-etcd` | false | Use etcd for metadata exchange (default: TCPStore) |

## Example Output

```
======================================================================
NIXL EP Data Plane Performance Test
======================================================================
Mode: e2e
Ranks: 8, Tokens: 128, Hidden: 7168
Experts: 36/rank (288 total), TopK: 8
Backend: none (RDMA forced)
Warmup: 10, Measure: 100 iterations
======================================================================
======================================================================
Data Plane (e2e): 8/8 ranks passed
======================================================================
Bandwidth (GB/s): avg=42.88, min=42.86, max=42.89
Latency (μs): avg=519.3, min=519.1, max=519.5
```

## Expected Performance (DFW cluster, RDMA, AR=no)

| Mode | Bandwidth | Latency |
|------|-----------|---------|
| E2E | ~42.8 GB/s | ~520 μs |
| Dispatch | ~42.1 GB/s | ~180 μs |
| Combine | ~43.3 GB/s | ~340 μs |

*Config: 128 tokens, 7168 hidden, topk=8, 288 experts (36/rank), 8 GPUs*

## Control Plane Tests

Measures latency of control plane operations (init, connect, disconnect, destroy).

### Single-Node (Default)

```bash
# Full cycle (init → connect → disconnect → reconnect → destroy)
python3 test_control_plane.py --num-processes=8

# Specific expert counts
python3 test_control_plane.py --num-processes=8 --experts-per-rank=8,32

# Single operation
python3 test_control_plane.py --num-processes=8 --test=connect

# Use etcd instead of TCPStore (if needed)
python3 test_control_plane.py --num-processes=8 --use-etcd
```

### Multi-Node Setup

Use environment variables `WORLD_SIZE`, `RANK`, and `MASTER_ADDR` for multi-node testing:

**Master Node (RANK=0):**
```bash
WORLD_SIZE=2 RANK=0 MASTER_ADDR=node0.example.com \
python3 test_control_plane.py --num-processes=8
```

**Worker Node (RANK=1):**
```bash
WORLD_SIZE=2 RANK=1 MASTER_ADDR=node0.example.com \
python3 test_control_plane.py --num-processes=8
```

**Or use CLI flags:**
```bash
# Master
python3 test_control_plane.py --num-processes=8 --world-size=2 --rank=0 --master-addr=node0

# Worker
python3 test_control_plane.py --num-processes=8 --world-size=2 --rank=1 --master-addr=node0
```

**Key points:**
- `WORLD_SIZE` = number of nodes (not total ranks)
- `--num-processes` = GPUs per node
- Total ranks = WORLD_SIZE × num-processes
- RANK=0 is master (runs TCPStore/rank server)
- RANK>0 are workers (connect to master)
- TCPStore is used by default (no etcd dependency); use `--use-etcd` to switch to etcd

### Example Output

```
======================================================================
Control Plane: 8 experts/rank x 8 ranks = 64 total
======================================================================
Operation Avg (ms) Min (ms) Max (ms)
----------------------------------------------------------------------
init 150.23 148.15 152.31
connect 245.67 242.33 248.91
disconnect 12.45 11.23 13.67
reconnect 198.34 195.12 201.56
destroy 85.12 83.45 86.79
----------------------------------------------------------------------
TOTAL 691.81
======================================================================
```

## Files

| File | Description |
|------|-------------|
| `test_data_plane.py` | Data plane test (dispatch/combine/e2e) |
| `test_control_plane.py` | Control plane test (init/connect/disconnect/destroy) |
| `mp_runner.py` | Multi-process test runner |
| `rank_server.py` | Coordination server for distributed tests |

3 changes: 3 additions & 0 deletions test/python/nixl_ep_perf/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""NIXL EP data plane performance tests."""
Loading