Skip to content

Conversation

@XLC127
Copy link

@XLC127 XLC127 commented Oct 31, 2025

Why are these changes needed?

Metax GPU is rapidly evolving and now supports inference for mainstream LLMs via vLLM. To scale these workloads across clusters, we propose leveraging Ray’s robust distributed task and resource management system.

@XLC127 XLC127 requested review from a team as code owners October 31, 2025 09:52
cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Metax GPUs to Ray. The changes are well-structured, including documentation, the MetaxGPUAcceleratorManager implementation, and corresponding tests. My review focuses on improving resource safety by ensuring proper cleanup, removing potentially incorrect logic copied from other accelerator managers, and enhancing the robustness and coverage of the new tests.

Comment on lines 44 to 50
try:
pymxsml.mxSmlExInit()
except pymxsml.MXSMLEXError:
return 0
device_count = pymxsml.mxSmlExDeviceGetCount()
pymxsml.mxSmlExShutdown()
return device_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pymxsml.mxSmlExShutdown() call should be in a finally block to ensure that the pymxsml library is properly shut down, even if pymxsml.mxSmlExDeviceGetCount() raises an exception. This prevents potential resource leaks.

Suggested change
try:
pymxsml.mxSmlExInit()
except pymxsml.MXSMLEXError:
return 0
device_count = pymxsml.mxSmlExDeviceGetCount()
pymxsml.mxSmlExShutdown()
return device_count
try:
pymxsml.mxSmlExInit()
except pymxsml.MXSMLEXError:
return 0
try:
device_count = pymxsml.mxSmlExDeviceGetCount()
finally:
pymxsml.mxSmlExShutdown()
return device_count

Comment on lines 56 to 68
try:
pymxsml.mxSmlExInit()
except pymxsml.MXSMLEXError:
return None
device_name = None
device_count = pymxsml.mxSmlExDeviceGetCount()
if device_count > 0:
handle = pymxsml.mxSmlExDeviceGetHandleByIndex(0)
device_name = pymxsml.mxSmlExDeviceGetName(handle)
if isinstance(device_name, bytes):
device_name = device_name.decode("utf-8")
pymxsml.mxSmlExShutdown()
return device_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to get_current_node_num_accelerators, the pymxsml.mxSmlExShutdown() call should be in a finally block to ensure that the pymxsml library is properly shut down, even if errors occur while getting device information. This prevents potential resource leaks.

        try:
            pymxsml.mxSmlExInit()
        except pymxsml.MXSMLEXError:
            return None

        try:
            device_name = None
            device_count = pymxsml.mxSmlExDeviceGetCount()
            if device_count > 0:
                handle = pymxsml.mxSmlExDeviceGetHandleByIndex(0)
                device_name = pymxsml.mxSmlExDeviceGetName(handle)
                if isinstance(device_name, bytes):
                    device_name = device_name.decode("utf-8")
        finally:
            pymxsml.mxSmlExShutdown()
        return device_name

Comment on lines +23 to +35
def test_metax_gpu_type(shutdown_only):
with patch(
"ray._private.accelerators.MetaxGPUAcceleratorManager.get_current_node_accelerator_type",
return_value="MXC500",
):
from ray.util import accelerators

ray.init()
result = MetaxGPUAcceleratorManager.get_current_node_accelerator_type()
assert result == accelerators.METAX_C500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test test_metax_gpu_type calls ray.init(), which triggers accelerator detection. This can fail in test environments that do not have Metax GPU hardware or drivers installed because get_current_node_num_accelerators is not mocked. Please mock get_current_node_num_accelerators to make the test more robust and independent of the environment.

Suggested change
def test_metax_gpu_type(shutdown_only):
with patch(
"ray._private.accelerators.MetaxGPUAcceleratorManager.get_current_node_accelerator_type",
return_value="MXC500",
):
from ray.util import accelerators
ray.init()
result = MetaxGPUAcceleratorManager.get_current_node_accelerator_type()
assert result == accelerators.METAX_C500
@patch(
"ray._private.accelerators.MetaxGPUAcceleratorManager.get_current_node_num_accelerators",
return_value=1,
)
def test_metax_gpu_type(mock_get_num_accelerators, shutdown_only):
with patch(
"ray._private.accelerators.MetaxGPUAcceleratorManager.get_current_node_accelerator_type",
return_value="MXC500",
):
from ray.util import accelerators
ray.init()
result = MetaxGPUAcceleratorManager.get_current_node_accelerator_type()
assert result == accelerators.METAX_C500


* - METAX GPU
- GPU
- Experimental, supported by the community
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a trailing whitespace at the end of this line. Please remove it to maintain consistent formatting.

Suggested change
- Experimental, supported by the community
- Experimental, supported by the community

Comment on lines +35 to +36
if cuda_visible_devices == "NoDevFiles":
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check for cuda_visible_devices == "NoDevFiles" is a special case for NVIDIA GPUs when no devices are found. It's unlikely that Metax GPU drivers have the same behavior. This code seems to be copied from the NVIDIA accelerator manager and might be incorrect or dead code in this context. Please remove this check if it's not applicable to Metax GPUs.


os.environ[
MetaxGPUAcceleratorManager.get_visible_accelerator_ids_env_var()
] = ",".join([str(i) for i in visible_cuda_devices])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The visible_cuda_devices parameter is already a List[str], so the str(i) conversion inside the list comprehension is redundant. You can directly join the list elements.

        ] = ",".join(visible_cuda_devices)

Comment on lines +35 to +59
def test_get_current_process_visible_accelerator_ids(monkeypatch):
monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0")
assert MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [
"0"
]

monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0,4,7")
assert MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [
"0",
"4",
"7",
]

monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "")
assert (
MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == []
)

del os.environ["CUDA_VISIBLE_DEVICES"]
assert (
MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() is None
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve test coverage and validate the handling of all edge cases, please add a test case for when CUDA_VISIBLE_DEVICES is set to "NoDevFiles". This is especially important given that this behavior is likely specific to NVIDIA drivers and its applicability to Metax GPUs should be confirmed.

def test_get_current_process_visible_accelerator_ids(monkeypatch):
    monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0")
    assert MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [
        "0"
    ]

    monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0,4,7")
    assert MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [
        "0",
        "4",
        "7",
    ]

    monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "")
    assert (
        MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == []
    )

    monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "NoDevFiles")
    assert (
        MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == []
    )

    del os.environ["CUDA_VISIBLE_DEVICES"]
    assert (
        MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() is None
    )

@edoakes
Copy link
Collaborator

edoakes commented Oct 31, 2025

@jjyao PTAL

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core llm community-contribution Contributed by the community labels Oct 31, 2025
@jjyao
Copy link
Collaborator

jjyao commented Nov 3, 2025

Sorry for the late review. I'll take a look after the summit.

@XLC127 XLC127 force-pushed the support_metax branch 5 times, most recently from 3bdecdb to c2bec6a Compare November 3, 2025 09:20
MetaxGPUAcceleratorManager.set_current_process_visible_accelerator_ids(
["0", "1", "7"]
)
assert os.environ["CUDA_VISIBLE_DEVICES"] == "0,1,7"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: CUDA_VISIBLE_DEVICES leaks across tests - fix cleanup

test_set_current_process_visible_accelerator_ids sets CUDA_VISIBLE_DEVICES but never unsets it, causing the env var to leak across the test process and potentially affecting subsequent tests’ GPU detection and resource resolution. Add cleanup at the end (e.g., del os.environ["CUDA_VISIBLE_DEVICES"]) to prevent flakiness.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core llm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants