Skip to content

Conversation

@akkart-aws
Copy link
Collaborator

@akkart-aws akkart-aws commented Dec 21, 2025

What?

Fix incorrect EFA device selection when using CUDA_VISIBLE_DEVICES by switching from GPU ID-based to PCI bus ID-based topology mapping.

Why?

The previous implementation used GPU IDs (0, 1, 2...) to map GPUs to EFA devices. This broke when CUDA_VISIBLE_DEVICES was set because:

  • CUDA_VISIBLE_DEVICES=1 makes physical GPU 1 appear as device 0 to the application
  • Topology mapping still used physical GPU IDs, causing mismatch
  • Result: Wrong EFA devices selected for GPU memory, breaking GPU Direct RDMA

Example failure case:

CUDA_VISIBLE_DEVICES=1  # Physical GPU 1 → appears as device 0
App registers memory on "device 0" (actually physical GPU 1)
Old code: Looks up GPU 0 in topology → selects EFA devices for physical GPU 0

How?

Changed topology mapping from GPU ID to GPU-PCI bus ID:

  • Topology layer now uses GPU-PCI bus ID as key instead of GPU ID
  • gpu_to_efa_devicespci_to_efa_devices
  • getEfaDevicesForGpu(int)getEfaDevicesForGPUPci(string)

Backend queries PCI bus ID from memory address:

  • Extended cudaQueryAddr() to return GPU'S PCI bus ID via cuDeviceGetPCIBusId()
  • Query happens after setting CUDA context in registerMem()
  • PCI bus ID passed to rail_manager for topology lookup

Rail manager uses PCI bus ID for rail selection:

  • Updated registerMemory() and selectRailsForMemory() to accept PCI bus ID
  • Calls topology->getEfaDevicesForGPUPci(pci_bus_id) instead of GPU ID lookup

Benefits:

  • Works correctly with CUDA_VISIBLE_DEVICES - GPU-PCI bus ID is stable regardless of device renumbering
  • More robust - uses physical hardware identifier instead of application-visible device ID

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi akkart-aws! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

fengjica
fengjica previously approved these changes Dec 23, 2025
amitrad-aws
amitrad-aws previously approved these changes Dec 24, 2025
fengjica
fengjica previously approved these changes Dec 26, 2025
fengjica
fengjica previously approved these changes Dec 31, 2025
@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 5, 2026

/build

@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 5, 2026

/ok to test ab61957

@akkart-aws akkart-aws force-pushed the fix_gpu_id branch 2 times, most recently from 3fe6bfe to 09ddd26 Compare January 5, 2026 19:07
@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 5, 2026

/build

@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 5, 2026

/ok to test 3e21e6a

Fix incorrect EFA device selection when CUDA_VISIBLE_DEVICES is set by using
PCI bus IDs instead of enumeration order. Query physical GPU via cuPointerGetAttributes(),
map to hwloc topology index, and select correct EFA devices based on PCIe proximity.

Fixes GPU device ID mismatch between CUDA and hwloc enumeration that caused
wrong EFA rails to be selected in vLLM and multi-GPU workloads.
@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 6, 2026

/build

@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 6, 2026

/ok to test c596b28

@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 8, 2026

/build

@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 8, 2026

/ok to test c596b28

@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 8, 2026

/build

@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 8, 2026

/ok to test c596b28

@akkart-aws akkart-aws merged commit 7233020 into ai-dynamo:main Jan 8, 2026
17 of 18 checks passed
ovidiusm pushed a commit to ovidiusm/nixl that referenced this pull request Jan 8, 2026
mcuiaws pushed a commit to mcuiaws/nixl that referenced this pull request Jan 8, 2026
@akkart-aws akkart-aws deleted the fix_gpu_id branch January 9, 2026 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants