Problem Description
In a mixed-GPU system, jax.devices() reports the marketing name of device 0 for all devices. More critically, this appears to affect compilation — XLA compiles kernels for the wrong architecture on devices other than device 0.
Setup: gfx906 (AMD Radeon VII) as device 0, gfx908 (AMD Instinct MI100) as device 1.
Both devices visible:
>>> jax.devices()
[RocmDevice(id=0), RocmDevice(id=1)]
>>> [d.device_kind for d in jax.devices()]
['AMD Radeon VII', 'AMD Radeon VII'] # device 1 is actually MI100
Each device isolated:
# ROCR_VISIBLE_DEVICES=0
>>> jax.devices()[0].device_kind
'AMD Radeon VII' # correct
# ROCR_VISIBLE_DEVICES=1
>>> jax.devices()[0].device_kind
'AMD Instinct MI100' # correct
The naming issue isn't just cosmetic. When placing a matmul on device 1 (MI100) in multi-device mode, XLA fails with:
error: unsupported target: 'gfx906'
JaxRuntimeError: INTERNAL: Autotuning failed ... No valid config found!
This suggests XLA is compiling gfx906 kernels for the MI100. The same matmul works perfectly when the MI100 is isolated with ROCR_VISIBLE_DEVICES=1.
Non-GEMM operations (elementwise, reductions, FFT) do work on both devices in multi-device mode, though they emit Triton warnings like error: unsupported target: 'gfx906' — again suggesting wrong-arch compilation is being attempted.
Environment: ROCm 7.2.1 (rocm/dev-ubuntu-24.04:7.2.1-complete), JAX 0.9.1, jax-rocm7-plugin 0.9.1.post3, Ubuntu 22.04 host (kernel 6.8.0-87-generic).
Workaround: Use ROCR_VISIBLE_DEVICES to isolate a single GPU type per process.
Operating System
DOCKER:rocm/dev-ubuntu-24.04:7.2.1-complete
CPU
Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
GPU
MI100, Radeon VII
ROCm Version
7.2.1
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Problem Description
In a mixed-GPU system,
jax.devices()reports the marketing name of device 0 for all devices. More critically, this appears to affect compilation — XLA compiles kernels for the wrong architecture on devices other than device 0.Setup: gfx906 (AMD Radeon VII) as device 0, gfx908 (AMD Instinct MI100) as device 1.
Both devices visible:
Each device isolated:
The naming issue isn't just cosmetic. When placing a matmul on device 1 (MI100) in multi-device mode, XLA fails with:
This suggests XLA is compiling gfx906 kernels for the MI100. The same matmul works perfectly when the MI100 is isolated with
ROCR_VISIBLE_DEVICES=1.Non-GEMM operations (elementwise, reductions, FFT) do work on both devices in multi-device mode, though they emit Triton warnings like
error: unsupported target: 'gfx906'— again suggesting wrong-arch compilation is being attempted.Environment: ROCm 7.2.1 (rocm/dev-ubuntu-24.04:7.2.1-complete), JAX 0.9.1, jax-rocm7-plugin 0.9.1.post3, Ubuntu 22.04 host (kernel 6.8.0-87-generic).
Workaround: Use ROCR_VISIBLE_DEVICES to isolate a single GPU type per process.
Operating System
DOCKER:rocm/dev-ubuntu-24.04:7.2.1-complete
CPU
Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
GPU
MI100, Radeon VII
ROCm Version
7.2.1
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response