-
Notifications
You must be signed in to change notification settings - Fork 423
[Core] Add GPT-OSS model support for Ascend NPU #2421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the GPT-OSS model on Ascend NPU, a significant enhancement for the vLLM Ascend framework. The implementation is comprehensive, covering the model configuration, attention mechanism with sliding window and YARN RoPE scaling, and a Mixture-of-Experts layer using AscendFusedMoE
. The code is well-structured and aligns with vLLM's architectural patterns. My review has identified a few areas with unused parameters and attributes within the new model implementation. Addressing these will improve code clarity, reduce potential for bugs, and enhance maintainability.
vllm_ascend/models/gpt_oss.py
Outdated
# Sink attention weights for streaming attention | ||
self.sinks = nn.Parameter( | ||
torch.zeros(self.num_heads, dtype=torch.float32) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm_ascend/models/gpt_oss.py
Outdated
# Custom swiglu activation with limit | ||
self.swiglu_limit = config.swiglu_limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm_ascend/models/gpt_oss.py
Outdated
def forward( | ||
self, | ||
hidden_states: torch.Tensor, | ||
attn_metadata: Optional[AttentionMetadata] = None, | ||
) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm_ascend/models/gpt_oss.py
Outdated
# MLP | ||
hidden_states, residual = self.post_attention_layernorm( | ||
hidden_states, residual) | ||
hidden_states = self.mlp(hidden_states, attn_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Description
This PR adds support for OpenAI's GPT-OSS (Mixture-of-Experts) model on Ascend NPU through vLLM Ascend framework. This extends vLLM Ascend's model registry to support the GPT-OSS architecture.
What this PR does
Technical Implementation
Core Model Components
GPTOSSConfig
: Model configuration class with MoE and attention parametersGPTOSSAttention
: Sliding window attention implementation with RoPE supportGPTOSSMoELayer
: MoE layer usingAscendFusedMoE
for expert routingGPTOSSDecoderLayer
: Complete transformer decoder layerGPTOSSModel
: Main model class with embedding and layer stackGPTOSSForCausalLM
: Causal language modeling wrapperFiles Added/Modified
Core Implementation
vllm_ascend/models/gpt_oss.py
- GPT-OSS model implementationvllm_ascend/models/__init__.py
- Model registry updatesSupporting Tools and Examples
tools/convert_gpt_oss.py
- Model conversion utilityexamples/gpt_oss_example.py
- Usage examplesscripts/gpt_oss_quickstart.sh
- Quick start scriptTesting & Documentation
tests/ut/test_gpt_oss_model.py
- Unit testsdocs/source/models/gpt_oss.md
- DocumentationGPT_OSS_MIGRATION_README.md
- Migration guideUsage Example