-
Notifications
You must be signed in to change notification settings - Fork 78
Optimize MoE via chunk settings #658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds optimization support for Mixture of Experts (MoE) operations through configurable chunk settings. The changes introduce a new moe_chunk configuration option and implement dynamic chunk sizing based on token count to improve MoE performance.
Key Changes:
- Added
global_num_expertsparameter to all MoE operator classes to support chunk configuration - Implemented dynamic chunk size selection based on token count (64 for ≤1536 tokens, 256 for 1536-4096 tokens, 512 for >4096 tokens)
- Added new
moe_chunkfeature flag controlled byVLLM_MOE_CHUNKenvironment variable
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/extension/ops.py | Added global_num_experts parameter and _get_extra_kwargs method to MoE operator classes; implemented dynamic chunk sizing logic |
| vllm_gaudi/extension/features.py | Added moe_chunk configuration option with environment variable support |
| vllm_gaudi/ops/hpu_fused_moe.py | Updated VllmMixtureOfExpertsOp instantiation to pass global_num_experts |
| vllm_gaudi/ops/hpu_fp8.py | Updated FP8 MoE operator instantiations to pass global_num_experts |
| vllm_gaudi/ops/hpu_compressed_tensors.py | Updated compressed tensors MoE operator instantiation to pass global_num_experts |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _get_extra_kwargs(self, tokens_num: int): | ||
| if self.enable_moe_chunk: | ||
| if tokens_num <= 1536: | ||
| chunk_size = 64 | ||
| elif tokens_num > 1536 and tokens_num <= 4096: | ||
| chunk_size = 256 | ||
| else: | ||
| chunk_size = 512 | ||
| kwargs = { | ||
| "chunk_size": chunk_size, | ||
| "total_experts": self.global_num_experts, | ||
| } | ||
| else: | ||
| kwargs = {} | ||
| return kwargs |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _get_extra_kwargs method is duplicated identically across three classes (VllmMixtureOfExpertsOp, VllmMixtureOfExpertsOpFP8, and VllmMixtureOfExpertsOpFP8PerChannel). Consider extracting this logic into a shared base class or mixin to reduce code duplication and improve maintainability.
| if tokens_num <= 1536: | ||
| chunk_size = 64 | ||
| elif tokens_num > 1536 and tokens_num <= 4096: | ||
| chunk_size = 256 | ||
| else: | ||
| chunk_size = 512 |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic numbers for token thresholds (1536, 4096) and chunk sizes (64, 256, 512) should be defined as named constants at the module or class level to improve code maintainability and make it easier to tune these values.
| if tokens_num <= 1536: | |
| chunk_size = 64 | |
| elif tokens_num > 1536 and tokens_num <= 4096: | |
| chunk_size = 256 | |
| else: | |
| chunk_size = 512 | |
| if tokens_num <= TOKEN_THRESHOLD_1: | |
| chunk_size = CHUNK_SIZE_SMALL | |
| elif tokens_num > TOKEN_THRESHOLD_1 and tokens_num <= TOKEN_THRESHOLD_2: | |
| chunk_size = CHUNK_SIZE_MEDIUM | |
| else: | |
| chunk_size = CHUNK_SIZE_LARGE |
| if self.enable_moe_chunk: | ||
| if tokens_num <= 1536: | ||
| chunk_size = 64 | ||
| elif tokens_num > 1536 and tokens_num <= 4096: |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition tokens_num > 1536 is redundant in the elif clause since the previous if statement already handles tokens_num <= 1536. Simplify to elif tokens_num <= 4096:.
| elif tokens_num > 1536 and tokens_num <= 4096: | |
| elif tokens_num <= 4096: |
|
@yiliu30 @ranzhejiang please help review. |
Signed-off-by: Xinyu Chen <[email protected]>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
| chunk_size = 256 | ||
| else: | ||
| chunk_size = 512 | ||
| kwargs = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested by Copilot, it would be better to define a name for these magic numbers.
Also please add some UTs.
Others LGTM.
|
Agree with liuyi‘s comments, LGTM |
add new feature
VLLM_MOE_CHUNK, with this, chunk_size and global_num_experts will be passed to torch.ops.hpu.mixture_of_experts for better performance.