Add GPT-OSS converter with heterogeneous block pattern support #374

tscholak · 2025-10-12T03:20:14Z

Overview

This PR adds a comprehensive converter for OpenAI's GPT-OSS models (gpt-oss-120b and gpt-oss-20b) to Fast-LLM.

Key Features

Heterogeneous Block Pattern Support

Implements PatternBlockSequenceConfig to handle alternating sliding_attention and full_attention layers
Parses the layer_types field from HuggingFace config to create the correct block pattern
Follows the same approach as the Apriel hybrid SSM model converter

Mixture of Experts (MoE)

Supports 128 experts with 4 active per token (gpt-oss-120b)
Properly converts expert weights between Fast-LLM and HuggingFace formats
Based on Mixtral converter architecture

Architecture-Specific Features

YARN RoPE scaling: Handles advanced rotary positional embeddings
Grouped multi-query attention: 8 KV heads for efficient inference
Attention biases: Unlike Mistral/Mixtral, GPT-OSS supports attention biases
Large vocabulary: Handles ~201k tokens (o200k_harmony tokenizer)

Design Decisions

Why Mixtral over Llama?

GPT-OSS is fundamentally a MoE model, making Mixtral the natural base:

Similar MoE weight structure (gate + experts)
Expert routing mechanisms align
However, GPT-OSS supports attention biases (unlike Mistral/Mixtral), so we override that behavior

Heterogeneous Blocks

GPT-OSS alternates between two attention types:

Sliding window attention: Limited context window (128 tokens)
Full attention: Complete context access

This required using PatternBlockSequenceConfig rather than FixedBlockSequenceConfig.

Implementation Details

Files Changed

fast_llm/models/gpt/conversion/gpt_oss.py: Main converter implementation
- GptOssAttentionConverter: Handles YARN RoPE, biases, sliding windows
- GptOssMLPConverter: MoE weight conversion
- GptOssBlockConverter: Block variant support
- GptOssDecoderConverter: Layer type parsing and pattern creation
fast_llm/models/gpt/conversion/config.py: Checkpoint format definition
fast_llm/models/gpt/conversion/auto.py: Auto-detection registry
fast_llm/models/gpt/config.py: Model config registration
tests/utils/model_configs.py: Comprehensive test configuration

Test Configuration

Added gpt_oss test config with:

4 layers alternating between sliding and full attention
MoE with 4 experts (scaled down from 128 for testing)
YARN RoPE scaling
Attention biases enabled
Roundtrip conversion testing enabled

Testing

The converter includes test configuration for:

✅ Basic model operations
✅ Checkpoint save/load
✅ Roundtrip conversion (Fast-LLM ↔ HuggingFace)
✅ Distributed training

Run with:

pytest tests/models/test_checkpoint.py::test_conversion -k gpt_oss
pytest tests/models/test_checkpoint.py::test_converted_round_trip -k gpt_oss

Usage

# Load from HuggingFace format
pretrained:
  format: gpt_oss
  path: /path/to/gpt-oss-120b

# Export to HuggingFace format
training:
  export:
    format: gpt_oss

References

🤖 Generated with Claude Code

This commit adds a comprehensive converter for OpenAI's GPT-OSS models (gpt-oss-120b and gpt-oss-20b) to Fast-LLM. Key features: - Heterogeneous block pattern support using PatternBlockSequenceConfig for alternating sliding window and full attention layers - Mixture of Experts (MoE) support with 128 experts, 4 active per token - YARN RoPE scaling for positional embeddings - Grouped multi-query attention (8 KV heads) - Attention bias support (unlike Mistral/Mixtral) - Handles ~201k vocab size (o200k_harmony tokenizer) Implementation details: - Based on Mixtral converter (MoE architecture) rather than Llama - Parses layer_types field from HuggingFace config to create block patterns - Supports both import (HF → Fast-LLM) and export (Fast-LLM → HF) - Includes comprehensive test configuration for roundtrip conversion Files changed: - fast_llm/models/gpt/conversion/gpt_oss.py: Main converter implementation - fast_llm/models/gpt/conversion/config.py: Checkpoint format definition - fast_llm/models/gpt/conversion/auto.py: Auto-detection registry - fast_llm/models/gpt/config.py: Model config registration - tests/utils/model_configs.py: Test configuration with heterogeneous blocks 🤖 Generated with Claude Code

jlamypoirier

Looks good but could be simplified quite a bit.

jlamypoirier · 2025-10-14T20:22:48Z