Skip to content

[Issue]: Multi-device setup reports device 0's name and architecture for all GPUs #390

@alexschroeter

Description

@alexschroeter

Problem Description

In a mixed-GPU system, jax.devices() reports the marketing name of device 0 for all devices. More critically, this appears to affect compilation — XLA compiles kernels for the wrong architecture on devices other than device 0.

Setup: gfx906 (AMD Radeon VII) as device 0, gfx908 (AMD Instinct MI100) as device 1.

Both devices visible:

>>> jax.devices()
[RocmDevice(id=0), RocmDevice(id=1)]
>>> [d.device_kind for d in jax.devices()]
['AMD Radeon VII', 'AMD Radeon VII']  # device 1 is actually MI100

Each device isolated:

# ROCR_VISIBLE_DEVICES=0
>>> jax.devices()[0].device_kind
'AMD Radeon VII'  # correct
# ROCR_VISIBLE_DEVICES=1
>>> jax.devices()[0].device_kind
'AMD Instinct MI100'  # correct

The naming issue isn't just cosmetic. When placing a matmul on device 1 (MI100) in multi-device mode, XLA fails with:

error: unsupported target: 'gfx906'
JaxRuntimeError: INTERNAL: Autotuning failed ... No valid config found!

This suggests XLA is compiling gfx906 kernels for the MI100. The same matmul works perfectly when the MI100 is isolated with ROCR_VISIBLE_DEVICES=1.

Non-GEMM operations (elementwise, reductions, FFT) do work on both devices in multi-device mode, though they emit Triton warnings like error: unsupported target: 'gfx906' — again suggesting wrong-arch compilation is being attempted.

Environment: ROCm 7.2.1 (rocm/dev-ubuntu-24.04:7.2.1-complete), JAX 0.9.1, jax-rocm7-plugin 0.9.1.post3, Ubuntu 22.04 host (kernel 6.8.0-87-generic).

Workaround: Use ROCR_VISIBLE_DEVICES to isolate a single GPU type per process.

Operating System

DOCKER:rocm/dev-ubuntu-24.04:7.2.1-complete

CPU

Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz

GPU

MI100, Radeon VII

ROCm Version

7.2.1

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

Labels

status: triageIndicates an issue has been assigned for investigation.

Type

No type
No fields configured for issues without a type.

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions