-
Notifications
You must be signed in to change notification settings - Fork 99
DP: Fix for torch.compile #722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
DP: Fix for torch.compile #722
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements fixes to enable torch.compile compatibility for data parallel operations on HPU devices. The main issue addressed is that stack-style all-gather operations are incompatible with torch.compile, requiring a switch to concat-style operations.
Key Changes:
- Modified all-gather operations to use concat-style instead of stack-style for
torch.compilecompatibility - Added padding logic to ensure token counts are divisible by tensor parallel size for sequence parallel MOE
- Added null checks for DP metadata access to prevent potential errors during compilation
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_dp_utils.py | Adds padding to ensure num_tokens is divisible by tp_size for sequence parallel MOE |
| vllm_gaudi/ops/hpu_fused_moe.py | Adds null-safety checks when accessing DP metadata fields |
| vllm_gaudi/ops/hpu_fp8.py | Adds null-safety checks when accessing DP metadata fields |
| vllm_gaudi/distributed/device_communicators/hpu_communicator.py | Converts all-gather from stack-style to concat-style for torch.compile compatibility |
| tests/full_tests/ci_tests.sh | Enables and adds tests for DP2xTP2 configuration with both lazy mode and torch.compile mode |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
vllm_gaudi/ops/hpu_fused_moe.py
Outdated
| dp_metadata = get_hpu_dp_metadata() | ||
| hidden_states_across_dp = dp_metadata.hidden_states_across_dp if dp_metadata is not None else None |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code retrieves dp_metadata once but then checks if it's None on every subsequent attribute access. If dp_metadata is None on line 60, all three conditional expressions (lines 61, 64, 67) will set their variables to None, which could cause dispatch_tensor to fail since it may not handle None for the output parameter correctly. Consider adding an early check after line 60 or ensuring that dispatch_tensor properly handles None values.
vllm_gaudi/ops/hpu_fp8.py
Outdated
| dp_metadata = get_hpu_dp_metadata() | ||
| hidden_states_across_dp = dp_metadata.hidden_states_across_dp if dp_metadata is not None else None |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as in hpu_fused_moe.py: if dp_metadata is None, all three variables will be set to None, potentially causing issues in dispatch_tensor. The None handling should be consistent with the function's expectations or an early guard should be added.
| echo HABANA_VISIBLE_DEVICES=all VLLM_SKIP_WARMUP=true PT_HPU_LAZY_MODE=0 python -u vllm-gaudi/examples/data_parallel.py --dp-size 2 --tp-size 2 | ||
| HABANA_VISIBLE_DEVICES=all VLLM_SKIP_WARMUP=true PT_HPU_LAZY_MODE=0 python -u vllm-gaudi/examples/data_parallel.py --dp-size 2 --tp-size 2 | ||
| if [ $? -ne 0 ]; then | ||
| echo "Error: Test failed for data parallel size 2 + tensor parallel size 2" >&2 |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is ambiguous as it doesn't distinguish between the lazy mode test and the torch.compile mode test. The message should specify 'torch.compile mode' to match the test being run.
| echo "Error: Test failed for data parallel size 2 + tensor parallel size 2" >&2 | |
| echo "Error: Test failed for data parallel size 2 + tensor parallel size 2 (torch.compile mode)" >&2 |
b4b9690 to
2076a64
Compare
vllm_gaudi/ops/hpu_fused_moe.py
Outdated
| if layer.dp_size > 1: | ||
| hidden_states_across_dp = get_hpu_dp_metadata().hidden_states_across_dp | ||
| dp_metadata = get_hpu_dp_metadata() | ||
| hidden_states_across_dp = dp_metadata.hidden_states_across_dp if dp_metadata is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dp_size >1 the get_hpu_dp_metadata() should not return None, instead of checking if None, please add assert here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's only available under lazy+hpugraph mode since it can reuse the memory.
✅ CI PassedAll checks passed successfully against the following vllm commit: |
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
9f163f4 to
9639ca5
Compare
Signed-off-by: Xinyu Chen <[email protected]>
Signed-off-by: Xinyu Chen <[email protected]>
Signed-off-by: Xinyu Chen <[email protected]>
9639ca5 to
30289bf
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
No description provided.