-
Notifications
You must be signed in to change notification settings - Fork 6.9k
add support for Metax GPU #58338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
add support for Metax GPU #58338
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Metax GPUs to Ray. The changes are well-structured, including documentation, the MetaxGPUAcceleratorManager implementation, and corresponding tests. My review focuses on improving resource safety by ensuring proper cleanup, removing potentially incorrect logic copied from other accelerator managers, and enhancing the robustness and coverage of the new tests.
| try: | ||
| pymxsml.mxSmlExInit() | ||
| except pymxsml.MXSMLEXError: | ||
| return 0 | ||
| device_count = pymxsml.mxSmlExDeviceGetCount() | ||
| pymxsml.mxSmlExShutdown() | ||
| return device_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pymxsml.mxSmlExShutdown() call should be in a finally block to ensure that the pymxsml library is properly shut down, even if pymxsml.mxSmlExDeviceGetCount() raises an exception. This prevents potential resource leaks.
| try: | |
| pymxsml.mxSmlExInit() | |
| except pymxsml.MXSMLEXError: | |
| return 0 | |
| device_count = pymxsml.mxSmlExDeviceGetCount() | |
| pymxsml.mxSmlExShutdown() | |
| return device_count | |
| try: | |
| pymxsml.mxSmlExInit() | |
| except pymxsml.MXSMLEXError: | |
| return 0 | |
| try: | |
| device_count = pymxsml.mxSmlExDeviceGetCount() | |
| finally: | |
| pymxsml.mxSmlExShutdown() | |
| return device_count |
| try: | ||
| pymxsml.mxSmlExInit() | ||
| except pymxsml.MXSMLEXError: | ||
| return None | ||
| device_name = None | ||
| device_count = pymxsml.mxSmlExDeviceGetCount() | ||
| if device_count > 0: | ||
| handle = pymxsml.mxSmlExDeviceGetHandleByIndex(0) | ||
| device_name = pymxsml.mxSmlExDeviceGetName(handle) | ||
| if isinstance(device_name, bytes): | ||
| device_name = device_name.decode("utf-8") | ||
| pymxsml.mxSmlExShutdown() | ||
| return device_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to get_current_node_num_accelerators, the pymxsml.mxSmlExShutdown() call should be in a finally block to ensure that the pymxsml library is properly shut down, even if errors occur while getting device information. This prevents potential resource leaks.
try:
pymxsml.mxSmlExInit()
except pymxsml.MXSMLEXError:
return None
try:
device_name = None
device_count = pymxsml.mxSmlExDeviceGetCount()
if device_count > 0:
handle = pymxsml.mxSmlExDeviceGetHandleByIndex(0)
device_name = pymxsml.mxSmlExDeviceGetName(handle)
if isinstance(device_name, bytes):
device_name = device_name.decode("utf-8")
finally:
pymxsml.mxSmlExShutdown()
return device_name| def test_metax_gpu_type(shutdown_only): | ||
| with patch( | ||
| "ray._private.accelerators.MetaxGPUAcceleratorManager.get_current_node_accelerator_type", | ||
| return_value="MXC500", | ||
| ): | ||
| from ray.util import accelerators | ||
|
|
||
| ray.init() | ||
| result = MetaxGPUAcceleratorManager.get_current_node_accelerator_type() | ||
| assert result == accelerators.METAX_C500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test test_metax_gpu_type calls ray.init(), which triggers accelerator detection. This can fail in test environments that do not have Metax GPU hardware or drivers installed because get_current_node_num_accelerators is not mocked. Please mock get_current_node_num_accelerators to make the test more robust and independent of the environment.
| def test_metax_gpu_type(shutdown_only): | |
| with patch( | |
| "ray._private.accelerators.MetaxGPUAcceleratorManager.get_current_node_accelerator_type", | |
| return_value="MXC500", | |
| ): | |
| from ray.util import accelerators | |
| ray.init() | |
| result = MetaxGPUAcceleratorManager.get_current_node_accelerator_type() | |
| assert result == accelerators.METAX_C500 | |
| @patch( | |
| "ray._private.accelerators.MetaxGPUAcceleratorManager.get_current_node_num_accelerators", | |
| return_value=1, | |
| ) | |
| def test_metax_gpu_type(mock_get_num_accelerators, shutdown_only): | |
| with patch( | |
| "ray._private.accelerators.MetaxGPUAcceleratorManager.get_current_node_accelerator_type", | |
| return_value="MXC500", | |
| ): | |
| from ray.util import accelerators | |
| ray.init() | |
| result = MetaxGPUAcceleratorManager.get_current_node_accelerator_type() | |
| assert result == accelerators.METAX_C500 |
|
|
||
| * - METAX GPU | ||
| - GPU | ||
| - Experimental, supported by the community |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if cuda_visible_devices == "NoDevFiles": | ||
| return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check for cuda_visible_devices == "NoDevFiles" is a special case for NVIDIA GPUs when no devices are found. It's unlikely that Metax GPU drivers have the same behavior. This code seems to be copied from the NVIDIA accelerator manager and might be incorrect or dead code in this context. Please remove this check if it's not applicable to Metax GPUs.
|
|
||
| os.environ[ | ||
| MetaxGPUAcceleratorManager.get_visible_accelerator_ids_env_var() | ||
| ] = ",".join([str(i) for i in visible_cuda_devices]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def test_get_current_process_visible_accelerator_ids(monkeypatch): | ||
| monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0") | ||
| assert MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [ | ||
| "0" | ||
| ] | ||
|
|
||
| monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0,4,7") | ||
| assert MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [ | ||
| "0", | ||
| "4", | ||
| "7", | ||
| ] | ||
|
|
||
| monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "") | ||
| assert ( | ||
| MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [] | ||
| ) | ||
|
|
||
| del os.environ["CUDA_VISIBLE_DEVICES"] | ||
| assert ( | ||
| MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() is None | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve test coverage and validate the handling of all edge cases, please add a test case for when CUDA_VISIBLE_DEVICES is set to "NoDevFiles". This is especially important given that this behavior is likely specific to NVIDIA drivers and its applicability to Metax GPUs should be confirmed.
def test_get_current_process_visible_accelerator_ids(monkeypatch):
monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0")
assert MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [
"0"
]
monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "0,4,7")
assert MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [
"0",
"4",
"7",
]
monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "")
assert (
MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == []
)
monkeypatch.setenv("CUDA_VISIBLE_DEVICES", "NoDevFiles")
assert (
MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() == []
)
del os.environ["CUDA_VISIBLE_DEVICES"]
assert (
MetaxGPUAcceleratorManager.get_current_process_visible_accelerator_ids() is None
)|
@jjyao PTAL |
|
Sorry for the late review. I'll take a look after the summit. |
3bdecdb to
c2bec6a
Compare
| MetaxGPUAcceleratorManager.set_current_process_visible_accelerator_ids( | ||
| ["0", "1", "7"] | ||
| ) | ||
| assert os.environ["CUDA_VISIBLE_DEVICES"] == "0,1,7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: CUDA_VISIBLE_DEVICES leaks across tests - fix cleanup
test_set_current_process_visible_accelerator_ids sets CUDA_VISIBLE_DEVICES but never unsets it, causing the env var to leak across the test process and potentially affecting subsequent tests’ GPU detection and resource resolution. Add cleanup at the end (e.g., del os.environ["CUDA_VISIBLE_DEVICES"]) to prevent flakiness.
Signed-off-by: XLC127 <[email protected]>
Why are these changes needed?
Metax GPU is rapidly evolving and now supports inference for mainstream LLMs via vLLM. To scale these workloads across clusters, we propose leveraging Ray’s robust distributed task and resource management system.