Skip to content

Conversation

@xinyu-intel
Copy link
Contributor

@xinyu-intel xinyu-intel commented Nov 28, 2025

add new feature VLLM_MOE_CHUNK, with this, chunk_size and global_num_experts will be passed to torch.ops.hpu.mixture_of_experts for better performance.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds optimization support for Mixture of Experts (MoE) operations through configurable chunk settings. The changes introduce a new moe_chunk configuration option and implement dynamic chunk sizing based on token count to improve MoE performance.

Key Changes:

  • Added global_num_experts parameter to all MoE operator classes to support chunk configuration
  • Implemented dynamic chunk size selection based on token count (64 for ≤1536 tokens, 256 for 1536-4096 tokens, 512 for >4096 tokens)
  • Added new moe_chunk feature flag controlled by VLLM_MOE_CHUNK environment variable

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vllm_gaudi/extension/ops.py Added global_num_experts parameter and _get_extra_kwargs method to MoE operator classes; implemented dynamic chunk sizing logic
vllm_gaudi/extension/features.py Added moe_chunk configuration option with environment variable support
vllm_gaudi/ops/hpu_fused_moe.py Updated VllmMixtureOfExpertsOp instantiation to pass global_num_experts
vllm_gaudi/ops/hpu_fp8.py Updated FP8 MoE operator instantiations to pass global_num_experts
vllm_gaudi/ops/hpu_compressed_tensors.py Updated compressed tensors MoE operator instantiation to pass global_num_experts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +958 to +972
def _get_extra_kwargs(self, tokens_num: int):
if self.enable_moe_chunk:
if tokens_num <= 1536:
chunk_size = 64
elif tokens_num > 1536 and tokens_num <= 4096:
chunk_size = 256
else:
chunk_size = 512
kwargs = {
"chunk_size": chunk_size,
"total_experts": self.global_num_experts,
}
else:
kwargs = {}
return kwargs
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _get_extra_kwargs method is duplicated identically across three classes (VllmMixtureOfExpertsOp, VllmMixtureOfExpertsOpFP8, and VllmMixtureOfExpertsOpFP8PerChannel). Consider extracting this logic into a shared base class or mixin to reduce code duplication and improve maintainability.

Copilot uses AI. Check for mistakes.
Comment on lines +504 to +509
if tokens_num <= 1536:
chunk_size = 64
elif tokens_num > 1536 and tokens_num <= 4096:
chunk_size = 256
else:
chunk_size = 512
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic numbers for token thresholds (1536, 4096) and chunk sizes (64, 256, 512) should be defined as named constants at the module or class level to improve code maintainability and make it easier to tune these values.

Suggested change
if tokens_num <= 1536:
chunk_size = 64
elif tokens_num > 1536 and tokens_num <= 4096:
chunk_size = 256
else:
chunk_size = 512
if tokens_num <= TOKEN_THRESHOLD_1:
chunk_size = CHUNK_SIZE_SMALL
elif tokens_num > TOKEN_THRESHOLD_1 and tokens_num <= TOKEN_THRESHOLD_2:
chunk_size = CHUNK_SIZE_MEDIUM
else:
chunk_size = CHUNK_SIZE_LARGE

Copilot uses AI. Check for mistakes.
if self.enable_moe_chunk:
if tokens_num <= 1536:
chunk_size = 64
elif tokens_num > 1536 and tokens_num <= 4096:
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition tokens_num > 1536 is redundant in the elif clause since the previous if statement already handles tokens_num <= 1536. Simplify to elif tokens_num <= 4096:.

Suggested change
elif tokens_num > 1536 and tokens_num <= 4096:
elif tokens_num <= 4096:

Copilot uses AI. Check for mistakes.
@xinyu-intel
Copy link
Contributor Author

@yiliu30 @ranzhejiang please help review.

@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
0353d2e162cbda776d9dbfe026e65303204a7f1f

chunk_size = 256
else:
chunk_size = 512
kwargs = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested by Copilot, it would be better to define a name for these magic numbers.
Also please add some UTs.
Others LGTM.

@ranzhejiang
Copy link

Agree with liuyi‘s comments, LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants