add support for Metax GPU #58338

XLC127 · 2025-10-31T09:52:03Z

Why are these changes needed?

Metax GPU is rapidly evolving and now supports inference for mainstream LLMs via vLLM. To scale these workloads across clusters, we propose leveraging Ray’s robust distributed task and resource management system.

gemini-code-assist

Code Review

This pull request adds support for Metax GPUs to Ray. The changes are well-structured, including documentation, the MetaxGPUAcceleratorManager implementation, and corresponding tests. My review focuses on improving resource safety by ensuring proper cleanup, removing potentially incorrect logic copied from other accelerator managers, and enhancing the robustness and coverage of the new tests.

gemini-code-assist · 2025-10-31T09:55:47Z

python/ray/_private/accelerators/metax_gpu.py

+        try:
+            pymxsml.mxSmlExInit()
+        except pymxsml.MXSMLEXError:
+            return 0
+        device_count = pymxsml.mxSmlExDeviceGetCount()
+        pymxsml.mxSmlExShutdown()
+        return device_count


The pymxsml.mxSmlExShutdown() call should be in a finally block to ensure that the pymxsml library is properly shut down, even if pymxsml.mxSmlExDeviceGetCount() raises an exception. This prevents potential resource leaks.

Suggested change

try:

pymxsml.mxSmlExInit()

except pymxsml.MXSMLEXError:

return 0

device_count = pymxsml.mxSmlExDeviceGetCount()

pymxsml.mxSmlExShutdown()

return device_count

try:

pymxsml.mxSmlExInit()

except pymxsml.MXSMLEXError:

return 0

try:

device_count = pymxsml.mxSmlExDeviceGetCount()

finally:

pymxsml.mxSmlExShutdown()

return device_count

gemini-code-assist · 2025-10-31T09:55:48Z

python/ray/_private/accelerators/metax_gpu.py

+        try:
+            pymxsml.mxSmlExInit()
+        except pymxsml.MXSMLEXError:
+            return None
+        device_name = None
+        device_count = pymxsml.mxSmlExDeviceGetCount()
+        if device_count > 0:
+            handle = pymxsml.mxSmlExDeviceGetHandleByIndex(0)
+            device_name = pymxsml.mxSmlExDeviceGetName(handle)
+            if isinstance(device_name, bytes):
+                device_name = device_name.decode("utf-8")
+        pymxsml.mxSmlExShutdown()
+        return device_name


Similar to get_current_node_num_accelerators, the pymxsml.mxSmlExShutdown() call should be in a finally block to ensure that the pymxsml library is properly shut down, even if errors occur while getting device information. This prevents potential resource leaks.

try: pymxsml.mxSmlExInit() except pymxsml.MXSMLEXError: return None try: device_name = None device_count = pymxsml.mxSmlExDeviceGetCount() if device_count > 0: handle = pymxsml.mxSmlExDeviceGetHandleByIndex(0) device_name = pymxsml.mxSmlExDeviceGetName(handle) if isinstance(device_name, bytes): device_name = device_name.decode("utf-8") finally: pymxsml.mxSmlExShutdown() return device_name

gemini-code-assist · 2025-10-31T09:55:48Z

python/ray/tests/accelerators/test_metax_gpu.py

+def test_metax_gpu_type(shutdown_only):
+    with patch(
+        "ray._private.accelerators.MetaxGPUAcceleratorManager.get_current_node_accelerator_type",
+        return_value="MXC500",
+    ):
+        from ray.util import accelerators
+
+        ray.init()
+        result = MetaxGPUAcceleratorManager.get_current_node_accelerator_type()
+        assert result == accelerators.METAX_C500


The test test_metax_gpu_type calls ray.init(), which triggers accelerator detection. This can fail in test environments that do not have Metax GPU hardware or drivers installed because get_current_node_num_accelerators is not mocked. Please mock get_current_node_num_accelerators to make the test more robust and independent of the environment.

Suggested change

def test_metax_gpu_type(shutdown_only):

with patch(

"ray._private.accelerators.MetaxGPUAcceleratorManager.get_current_node_accelerator_type",

return_value="MXC500",

):

from ray.util import accelerators

ray.init()

result = MetaxGPUAcceleratorManager.get_current_node_accelerator_type()

assert result == accelerators.METAX_C500

@patch(

"ray._private.accelerators.MetaxGPUAcceleratorManager.get_current_node_num_accelerators",

return_value=1,

)

def test_metax_gpu_type(mock_get_num_accelerators, shutdown_only):

with patch(

"ray._private.accelerators.MetaxGPUAcceleratorManager.get_current_node_accelerator_type",

return_value="MXC500",

):

from ray.util import accelerators

ray.init()

result = MetaxGPUAcceleratorManager.get_current_node_accelerator_type()

assert result == accelerators.METAX_C500

gemini-code-assist · 2025-10-31T09:55:48Z

doc/source/ray-core/scheduling/accelerators.rst

-
+   * - METAX GPU
+     - GPU
+     - Experimental, supported by the community 


There is a trailing whitespace at the end of this line. Please remove it to maintain consistent formatting.

Suggested change

- Experimental, supported by the community

- Experimental, supported by the community

gemini-code-assist · 2025-10-31T09:55:48Z

python/ray/_private/accelerators/metax_gpu.py

+        if cuda_visible_devices == "NoDevFiles":
+            return []


The check for cuda_visible_devices == "NoDevFiles" is a special case for NVIDIA GPUs when no devices are found. It's unlikely that Metax GPU drivers have the same behavior. This code seems to be copied from the NVIDIA accelerator manager and might be incorrect or dead code in this context. Please remove this check if it's not applicable to Metax GPUs.

gemini-code-assist · 2025-10-31T09:55:48Z

python/ray/_private/accelerators/metax_gpu.py

+
+        os.environ[
+            MetaxGPUAcceleratorManager.get_visible_accelerator_ids_env_var()
+        ] = ",".join([str(i) for i in visible_cuda_devices])


The visible_cuda_devices parameter is already a List[str], so the str(i) conversion inside the list comprehension is redundant. You can directly join the list elements.

] = ",".join(visible_cuda_devices)

gemini-code-assist · 2025-10-31T09:55:48Z

python/ray/tests/accelerators/test_metax_gpu.py

+def test_get_current_process_visible_accelerator_ids(monkeypatch):
+    monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0")
+    assert MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [
+        "0"
+    ]
+
+    monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0,4,7")
+    assert MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [
+        "0",
+        "4",
+        "7",
+    ]
+
+    monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "")
+    assert (
+        MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == []
+    )
+
+    del os.environ["CUDA_VISIBLE_DEVICES"]
+    assert (
+        MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() is None
+    )


To improve test coverage and validate the handling of all edge cases, please add a test case for when CUDA_VISIBLE_DEVICES is set to "NoDevFiles". This is especially important given that this behavior is likely specific to NVIDIA drivers and its applicability to Metax GPUs should be confirmed.

def test_get_current_process_visible_accelerator_ids(monkeypatch): monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0") assert MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [ "0" ] monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0,4,7") assert MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [ "0", "4", "7", ] monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "") assert ( MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [] ) monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "NoDevFiles") assert ( MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [] ) del os.environ["CUDA_VISIBLE_DEVICES"] assert ( MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() is None )

edoakes · 2025-10-31T12:56:12Z

@jjyao PTAL

jjyao · 2025-11-03T06:07:15Z

Sorry for the late review. I'll take a look after the summit.

cursor · 2025-11-03T09:54:07Z

python/ray/tests/accelerators/test_metax_gpu.py

+    MetaxGPUAcceleratorManager.set_current_process_visible_accelerator_ids(
+        ["0", "1", "7"]
+    )
+    assert os.environ["CUDA_VISIBLE_DEVICES"] == "0,1,7"


Bug: CUDA_VISIBLE_DEVICES leaks across tests - fix cleanup

test_set_current_process_visible_accelerator_ids sets CUDA_VISIBLE_DEVICES but never unsets it, causing the env var to leak across the test process and potentially affecting subsequent tests’ GPU detection and resource resolution. Add cleanup at the end (e.g., del os.environ["CUDA_VISIBLE_DEVICES"]) to prevent flakiness.

Signed-off-by: XLC127 <[email protected]>

XLC127 requested review from a team as code owners October 31, 2025 09:52

This comment was marked as outdated.

Sign in to view

gemini-code-assist bot reviewed Oct 31, 2025

View reviewed changes

edoakes assigned jjyao Oct 31, 2025

ray-gardener bot added core Issues that should be addressed in Ray Core llm community-contribution Contributed by the community labels Oct 31, 2025

XLC127 force-pushed the support_metax branch from cedfe78 to 9e9e1cd Compare November 3, 2025 05:54

XLC127 force-pushed the support_metax branch 5 times, most recently from 3bdecdb to c2bec6a Compare November 3, 2025 09:20

cursor bot reviewed Nov 3, 2025

View reviewed changes

XLC127 force-pushed the support_metax branch from c2bec6a to e053cff Compare November 6, 2025 07:46

add support for Metax GPU

1b60dbc

Signed-off-by: XLC127 <[email protected]>

XLC127 force-pushed the support_metax branch from e053cff to 1b60dbc Compare November 6, 2025 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add support for Metax GPU #58338

add support for Metax GPU #58338

XLC127 commented Oct 31, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

edoakes commented Oct 31, 2025

Uh oh!

jjyao commented Nov 3, 2025

Uh oh!

cursor bot Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	- Experimental, supported by the community
	- Experimental, supported by the community

add support for Metax GPU #58338

Are you sure you want to change the base?

add support for Metax GPU #58338

Conversation

XLC127 commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Uh oh!

This comment was marked as outdated.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes commented Oct 31, 2025

Uh oh!

jjyao commented Nov 3, 2025

Uh oh!

cursor bot Nov 3, 2025

Choose a reason for hiding this comment

Bug: CUDA_VISIBLE_DEVICES leaks across tests - fix cleanup

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XLC127 commented Oct 31, 2025 •

edited

Loading