-
Notifications
You must be signed in to change notification settings - Fork 38
Add GPT-OSS converter with heterogeneous block pattern support #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit adds a comprehensive converter for OpenAI's GPT-OSS models (gpt-oss-120b and gpt-oss-20b) to Fast-LLM. Key features: - Heterogeneous block pattern support using PatternBlockSequenceConfig for alternating sliding window and full attention layers - Mixture of Experts (MoE) support with 128 experts, 4 active per token - YARN RoPE scaling for positional embeddings - Grouped multi-query attention (8 KV heads) - Attention bias support (unlike Mistral/Mixtral) - Handles ~201k vocab size (o200k_harmony tokenizer) Implementation details: - Based on Mixtral converter (MoE architecture) rather than Llama - Parses layer_types field from HuggingFace config to create block patterns - Supports both import (HF → Fast-LLM) and export (Fast-LLM → HF) - Includes comprehensive test configuration for roundtrip conversion Files changed: - fast_llm/models/gpt/conversion/gpt_oss.py: Main converter implementation - fast_llm/models/gpt/conversion/config.py: Checkpoint format definition - fast_llm/models/gpt/conversion/auto.py: Auto-detection registry - fast_llm/models/gpt/config.py: Model config registration - tests/utils/model_configs.py: Test configuration with heterogeneous blocks 🤖 Generated with Claude Code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but could be simplified quite a bit.
from fast_llm.utils import Assert, safe_merge_dicts | ||
|
||
|
||
class GptOssAttentionConverter(MistralAttentionConverter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be simpler to start from llama and repeat the sliding_window
option?
Assert.incl(config.dense_layer.bias.enabled, (None, config.add_linear_biases)) | ||
|
||
|
||
class GptOssMLPConverter(LlamaMLPConverter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Select Mistral / Mixtral converter dynamically instead? (See Apriel converter)
@classmethod | ||
def import_config(cls, config: dict) -> dict: | ||
"""Import decoder config, handling heterogeneous layer types.""" | ||
layer_types = config.get("layer_types", []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config.get("layer_types", ["full_attention"])
- "full": Full attention block | ||
""" | ||
|
||
layout_names = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only used in GptOssDecoderConverter
, move?
"sliding_attention": "sliding", | ||
"full_attention": "full", | ||
} | ||
reverse_layout_names = {v: k for k, v in layout_names.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused
] | ||
|
||
|
||
class GptOssBlockConverter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not inheriting from Llama converter?
attention_config = cls.mixer_converter_class.import_config(config) | ||
|
||
# For sliding attention, ensure window_size is set | ||
if layer_type == "sliding_attention": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to `GptOssDecoderConverter1 to avoid the extra ragument?
tests/utils/model_configs.py
Outdated
}, | ||
}, | ||
"num_blocks": 4, | ||
"pattern": ["sliding", "full", "sliding", "full"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
["sliding", "full"]
tests/utils/model_configs.py
Outdated
ModelTestingGroup.convert: ModelTestingGroupAction.normal, | ||
ModelTestingGroup.generate: ModelTestingGroupAction.not_implemented, | ||
ModelTestingGroup.megatron: ModelTestingGroupAction.not_implemented, | ||
ModelTestingGroup.distributed: ModelTestingGroupAction.normal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Distributed is very costly and not strictly needed here, let's use unimportant
}, | ||
compare_factor=2.0, | ||
# Micro-sequence split not supported (due to MoE). | ||
skip_tests=("ms",), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this? It's not there for Mixtral.
This commit fixes several critical issues with the GPT-OSS model converter and works around a Triton kernel bug on ARM64. ## Issues Fixed: 1. **Triton sparse_map_kernel bug on ARM64 (Triton 3.3.1+)** - The kernel produces incorrect sparse_rows indices on ARM64 - Workaround: Disabled Triton kernel to use PyTorch fallback - File: fast_llm/functional/triton/sparse_copy.py:312 2. **GPT-OSS heterogeneous block export** - Fixed sliding_window conflict in safe_merge_dicts (128 vs None) - Extract sliding_window separately before merging block configs - Use pattern matching for cleaner code - File: fast_llm/models/gpt/conversion/gpt_oss.py:315-366 3. **YARN RoPE configuration** - Fixed scale_factor field name (was incorrectly "scale") - Handle optional attention_factor field properly - Avoid parent class parsing YARN config incorrectly - Replace getattr/hasattr with pattern matching - File: fast_llm/models/gpt/conversion/gpt_oss.py:38-91 ## Tests Passing: - test_checkpoint_and_eval[gpt_oss] ✅ - test_conversion[gpt_oss] ✅ - test_checkpoint_and_eval[mixtral] ✅ - test_conversion[mixtral] ✅ - GPT-OSS 20B model loads from HuggingFace ✅ ## Test Artifacts: - test_sparse_map_debug.py: Comprehensive test suite for sparse_map kernel - test_gpt_oss_load.py: Validation script for loading GPT-OSS 20B 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…andler The HuggingFace standard uses "architectures" (plural, as a list), but the code was incorrectly using "architecture" (singular, as a string). This was introduced in commit 8864b23 (Block-modular models refactor) and broke loading checkpoints from the HuggingFace Hub. Changes: - Changed _export_config to use "architectures": [cls.architecture] instead of "architecture": cls.architecture - Changed _import_config to assert config["architectures"] == [cls.architecture] - Removed obsolete test_gpt_oss_load.py (superseded by test_gpt_oss_forward.py) This fix allows loading real HuggingFace checkpoints from the Hub, not just Fast-LLM's own exports.
Implemented all suggestions from @jlamypoirier's review: 1. Simplified GptOssAttentionConverter: - Now inherits from LlamaAttentionConverter (which already supports YARN) - Only adds attention_bias support - Removed duplicate YARN RoPE handling 2. Dynamic MLP converter selection (like Apriel): - Added _mlp_converter_classes dict mapping MLPConfig types - Dynamically selects LlamaMLPConverter or MixtralMLPConverter - Removed custom GptOssMLPConverter class 3. Improved code organization: - Added _get_layer_type method to GptOssDecoderConverter - Removed unused reverse_layout_names - Moved sliding_window logic to decoder converter 4. Simplified test configuration: - Changed pattern from ["sliding", "full", "sliding", "full"] to ["sliding", "full"] - Changed distributed testing group from normal to unimportant 5. Added default for layer_types: - config.get("layer_types", ["full_attention"]) The refactored code is cleaner, more maintainable, and follows the existing patterns in the codebase (Llama, Mistral, Apriel).
- Add missing "factor" field to YARN rope_scaling export - Fix import to correctly access nested rope_scaling fields for both llama3 and yarn The HuggingFace transformers library requires the "factor" field in rope_scaling for YARN RoPE. Previously we were only exporting attention_factor, beta_fast, beta_slow, and original_max_position_embeddings. Also fixed the import side to correctly access fields from the nested rope_scaling dictionary instead of the top-level config dictionary. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Implement _find_minimal_repeating_pattern() to detect and compress repeating patterns in HuggingFace layer_types during import. HuggingFace GPT-OSS models export the full expanded pattern (e.g., ["sliding_attention", "full_attention", "sliding_attention", "full_attention"]) to satisfy their validation requirement that len(layer_types) == num_hidden_layers. When importing, we now detect the minimal repeating cycle (e.g., ["sliding_attention", "full_attention"]) to enable compact internal representation as a PatternBlockSequenceConfig with pattern=["sliding", "full"]. This ensures proper round-trip conversion while satisfying both HuggingFace's validation requirements and Fast-LLM's efficient pattern representation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Fix YARN RoPE import: make attention_factor optional (defaults to None) - Fix MLP bias import: make mlp_bias optional (defaults to False) - Add GptOssMLPConverter to handle dequantized MoE format: - Router at .router (not .gate like Mixtral) - Concatenated gate_up_proj/down_proj (not w1/w2/w3 like Mixtral) - Update test to dequantize MXFP4 weights before conversion The GPT-OSS HuggingFace checkpoint uses MXFP4 quantization (uint8 blocks and scales). The test now loads with HF's Mxfp4Config(dequantize=True) to convert quantized weights to standard float format before Fast-LLM conversion. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This is work-in-progress on adding support for MoE biases in the GPT-OSS checkpoint converter and Fast-LLM MoE implementation. Changes: - Add sparse bias handling in linear.py for MoE expert biases - Implement sparse bias gradient computation for 2D expert biases - Add bias support to MoE MLP forward/backward in triton/mlp.py - Manually create layer_2 bias with expert dimension in mixture_of_experts.py - Add GptOssMoEBiasConverter for transposed bias format - Add router bias support to GPT-OSS converter - Add attention sinks support to GPT-OSS converter - Update test configurations Status: - test_checkpoint_and_eval: PASSING ✓ - test_conversion: FAILING (parameter registration issue) - test_converted_round_trip: SKIPPED The conversion test fails because MoE layer_2 biases are not being properly registered in _parameter_stages when loading from GPT-OSS format. The biases are correctly saved to GPT-OSS checkpoint but not recreated on import. Next steps: - Fix parameter registration for manually created MoE biases - Ensure biases are discovered during model initialization - Complete round-trip conversion testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Implement elegant subclass-based solution for per-expert biases in MoE layers: - Add MoEAffineLinearConfig subclass that overrides _get_weight_out_dim and _get_bias_dims - Weight uses only output feature dimension, bias uses full (experts, features) structure - Simplify MoE layer initialization by using composite dimensions directly - Update GPT-OSS converter to use moe_affine_linear type for expert layers - Remove unnecessary bias replacement code in MixtureOfExpertMLP 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The key insight is that layer 1 and layer 2 have different sparsity patterns: - Layer 1 (output-parallel sparse): weight needs flattened size (num_experts * features) - Layer 2 (input-parallel sparse): weight needs only feature dimension Solution: Pass transposed_weight parameter to _get_weight_out_dim() to determine which dimension to use: - Non-transposed (layer 1): return full CompositeTensorDim with flattened size - Transposed (layer 2): return last component (feature dimension only) This allows both layers to use MoEAffineLinearConfig while generating correct weight shapes for their respective sparse matmul operations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
HuggingFace stores MoE expert weights with shape (num_experts, dim1, dim2) while Fast-LLM expects flattened shape (num_experts * dim1, dim2). Add GptOssMoEWeightConverter to handle the reshaping: - Import: (num_experts, dim1, dim2) -> (num_experts * dim1, dim2) - Export: (num_experts * dim1, dim2) -> (num_experts, dim1, dim2) This allows checkpoint conversion to properly handle both gate_up_proj and down_proj weights for MoE layers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Overview
This PR adds a comprehensive converter for OpenAI's GPT-OSS models (gpt-oss-120b and gpt-oss-20b) to Fast-LLM.
Key Features
Heterogeneous Block Pattern Support
PatternBlockSequenceConfig
to handle alternatingsliding_attention
andfull_attention
layerslayer_types
field from HuggingFace config to create the correct block patternMixture of Experts (MoE)
Architecture-Specific Features
Design Decisions
Why Mixtral over Llama?
GPT-OSS is fundamentally a MoE model, making Mixtral the natural base:
Heterogeneous Blocks
GPT-OSS alternates between two attention types:
This required using
PatternBlockSequenceConfig
rather thanFixedBlockSequenceConfig
.Implementation Details
Files Changed
fast_llm/models/gpt/conversion/gpt_oss.py
: Main converter implementationGptOssAttentionConverter
: Handles YARN RoPE, biases, sliding windowsGptOssMLPConverter
: MoE weight conversionGptOssBlockConverter
: Block variant supportGptOssDecoderConverter
: Layer type parsing and pattern creationfast_llm/models/gpt/conversion/config.py
: Checkpoint format definitionfast_llm/models/gpt/conversion/auto.py
: Auto-detection registryfast_llm/models/gpt/config.py
: Model config registrationtests/utils/model_configs.py
: Comprehensive test configurationTest Configuration
Added
gpt_oss
test config with:Testing
The converter includes test configuration for:
Run with:
Usage
References
🤖 Generated with Claude Code