@@ -528,6 +528,7 @@ NIXL Benchmark uses an ETCD key-value store for coordination between benchmark w
528528
5295291 . Ensure ETCD server is running (e.g., ` docker run -p 2379:2379 quay.io/coreos/etcd `
5305302 . Launch multiple nixlbench instances pointing to the same ETCD server
531+ 3 . Multiple instances should be launched within the default timeout of 60s.
531532
532533** For single-instance storage benchmarks:**
533534``` bash
@@ -538,21 +539,15 @@ NIXL Benchmark uses an ETCD key-value store for coordination between benchmark w
538539./nixlbench --etcd_endpoints http://etcd-server:2379 --backend GDS --filepath /mnt/storage/testfile
539540```
540541
541- Note: etcd can be installed directly on host as well:
542- ``` bash
543- apt install etcd-server
544- ```
545-
546- Example:
542+ ** For multi-instance storage benchmarks where ETCD is required:**
547543``` bash
548544# On host 1
549545./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --initiator_seg_type VRAM --target_seg_type VRAM
550546
551547# On host 2
552548./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --initiator_seg_type VRAM --target_seg_type VRAM
553549```
554-
555- The workers automatically coordinate ranks through ETCD as they connect.
550+ The workers automatically coordinate ranks through ETCD as they connect. Note, the second nixlbench should be started within 60s, otherwise the first instance will stop with an error in the barrier.
556551
557552### Backend-Specific Examples
558553
@@ -562,9 +557,11 @@ The workers automatically coordinate ranks through ETCD as they connect.
562557``` bash
563558# Basic UCX benchmark
564559./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX
560+ sleep 2 && ./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX
565561
566562# UCX with specific devices
567- ./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --device_list mlx5_0,mlx5_1
563+ $ host1 > ./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --device_list mlx5_0,mlx5_1
564+ $ host2 > sleep 2 && ./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --device_list mlx5_0,mlx5_1
568565```
569566
570567** GPUNETIO Backend:**
@@ -706,20 +703,6 @@ Transfer times are higher than local storage, so consider reducing iterations:
706703- Test read operations: ` --op_type READ `
707704- Validate data consistency: ` --check_consistency `
708705
709- ### Multi-Node Coordination
710-
711- Launch multiple nixlbench instances pointing to the same ETCD server:
712-
713- ``` bash
714- # On host 1
715- ./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --initiator_seg_type VRAM --target_seg_type VRAM
716-
717- # On host 2
718- ./nixlbench --etcd_endpoints http://etcd-server:2379 --backend UCX --initiator_seg_type VRAM --target_seg_type VRAM
719- ```
720-
721- The workers automatically coordinate ranks through ETCD as they connect.
722-
723706## Troubleshooting
724707
725708### Common Build Issues
@@ -814,6 +797,12 @@ export UCX_LOG_LEVEL=DEBUG # Verbose UCX logging
814797export UCX_PROTO_INFO=y # See transport used by UCX
815798```
816799
800+ #### ETCD Cleanup
801+ ``` bash
802+ # If a nixlbench instance failed you need to cleanup the etcd instance before starting nixlbench again
803+ ETCDCTL_API=3 etcdctl del " xferbench" --prefix=true
804+ ```
805+
817806### Performance Tuning
818807
819808#### CPU Affinity
@@ -842,4 +831,4 @@ sudo sysctl -p
842831
843832---
844833
845- * This guide covers NIXLBench build and usage procedures as of 2025. For the latest updates, please refer to the official repository.*
834+ * This guide covers NIXLBench build and usage procedures as of 2025. For the latest updates, please refer to the official repository.*
0 commit comments