Skip to content

Conversation

wusar
Copy link

@wusar wusar commented Aug 18, 2025

Description

This PR adds support for OpenAI's GPT-OSS (Mixture-of-Experts) model on Ascend NPU through vLLM Ascend framework. This extends vLLM Ascend's model registry to support the GPT-OSS architecture.

What this PR does

  • Adds GPT-OSS model implementation to vLLM Ascend model registry
  • Implements MoE architecture with 128 experts and top-4 routing using AscendFusedMoE
  • Adds sliding window attention mechanism (128 tokens for even layers, global for odd layers)
  • Integrates YARN RoPE scaling compatible with GPT-OSS configuration
  • Provides model conversion tools from HuggingFace format to vLLM-compatible format
  • Includes usage examples and documentation

Technical Implementation

Core Model Components

  • GPTOSSConfig: Model configuration class with MoE and attention parameters
  • GPTOSSAttention: Sliding window attention implementation with RoPE support
  • GPTOSSMoELayer: MoE layer using AscendFusedMoE for expert routing
  • GPTOSSDecoderLayer: Complete transformer decoder layer
  • GPTOSSModel: Main model class with embedding and layer stack
  • GPTOSSForCausalLM: Causal language modeling wrapper

Files Added/Modified

Core Implementation

  • vllm_ascend/models/gpt_oss.py - GPT-OSS model implementation
  • vllm_ascend/models/__init__.py - Model registry updates

Supporting Tools and Examples

  • tools/convert_gpt_oss.py - Model conversion utility
  • examples/gpt_oss_example.py - Usage examples
  • scripts/gpt_oss_quickstart.sh - Quick start script

Testing & Documentation

  • tests/ut/test_gpt_oss_model.py - Unit tests
  • docs/source/models/gpt_oss.md - Documentation
  • GPT_OSS_MIGRATION_README.md - Migration guide

Usage Example

from vllm import LLM, SamplingParams

# Initialize GPT-OSS model on Ascend NPU
llm = LLM(
    model="./gpt-oss-20b-converted",
    device="ascend", 
    tensor_parallel_size=1,
    dtype="bfloat16",
    trust_remote_code=True
)

# Generate text
prompts = ["Hello, how are you?"]
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

- vLLM version: v0.10.1.1
- vLLM main: https://github.com/vllm-project/vllm/commit/b00e69f8ca55f4a82847d39466f57ceb748324c1

@wusar wusar marked this pull request as draft August 18, 2025 08:38
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the GPT-OSS model on Ascend NPU, a significant enhancement for the vLLM Ascend framework. The implementation is comprehensive, covering the model configuration, attention mechanism with sliding window and YARN RoPE scaling, and a Mixture-of-Experts layer using AscendFusedMoE. The code is well-structured and aligns with vLLM's architectural patterns. My review has identified a few areas with unused parameters and attributes within the new model implementation. Addressing these will improve code clarity, reduce potential for bugs, and enhance maintainability.

Comment on lines 194 to 197
# Sink attention weights for streaming attention
self.sinks = nn.Parameter(
torch.zeros(self.num_heads, dtype=torch.float32)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The self.sinks parameter is initialized, presumably for streaming attention, but it is not used within the forward method. This results in dead code that should be removed to improve clarity and avoid unnecessary memory allocation.

Comment on lines 253 to 254
# Custom swiglu activation with limit
self.swiglu_limit = config.swiglu_limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The self.swiglu_limit attribute is set from the configuration but is never used in the GPTOSSMoELayer. If this parameter is not required by the AscendFusedMoE layer or elsewhere, it should be removed to eliminate dead code.

Comment on lines 256 to 260
def forward(
self,
hidden_states: torch.Tensor,
attn_metadata: Optional[AttentionMetadata] = None,
) -> torch.Tensor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The attn_metadata parameter is included in the forward method's signature but is not used within the method's body. To improve code clarity and maintainability, this unused parameter should be removed.

    def forward(
        self,
        hidden_states: torch.Tensor,
    ) -> torch.Tensor:

# MLP
hidden_states, residual = self.post_attention_layernorm(
hidden_states, residual)
hidden_states = self.mlp(hidden_states, attn_metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In light of removing the unused attn_metadata parameter from GPTOSSMoELayer.forward, this call site should be updated to no longer pass the attn_metadata argument.

        hidden_states = self.mlp(hidden_states)

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link

github-actions bot commented Sep 8, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant