Add Step 3.5 Flash model support with MTP by janhilgard · Pull Request #94 · waybarrios/vllm-mlx

janhilgard · 2026-02-16T18:43:44Z

Summary

Step 3.5 Flash (196B MoE, 288 experts, top-8, ~11B active params) — full MLX-native model support with MTP (Multi-Token Prediction)
scripts/add_mtp_weights_step3p5.py — downloads BF16 MTP shards, extracts layers 45-47, remaps to mtp.layers.*, quantizes to 4-bit, installs MTP-enabled modeling file
scripts/modeling_step3p5_mtp.py — complete model implementation with Step3p5MTP, Step3p5MTPLayer, Step3p5SharedHead, mtp_forward, make_mtp_cache
Reasoning parser alias step3p5 (reuses deepseek_r1 <think> tag parser)

Key differences from Qwen3-Next MTP (`add_mtp_weights.py`)

	Qwen3-Next	Step 3.5 Flash
MTP layers	1 (single shard)	3 (two shards, layers 45-47)
MTP architecture	MoE-based	Dense MLP + per-layer `shared_head`
Quantization	6-bit	4-bit
Key remapping	`mtp.*` passthrough	`model.layers.{45,46,47}.` → `mtp.layers.{0,1,2}.`

Usage

# 1. Download model and add MTP weights
python scripts/add_mtp_weights_step3p5.py

# 2. Serve with MTP enabled
vllm-mlx serve mlx-community/Step-3.5-Flash-4bit \
    --enable-mtp --port 1340 \
    --reasoning-parser step3p5

Note on custom modeling file

The MLX community 4-bit model ships without MTP support in its modeling_step3p5.py. The script automatically installs an MTP-enabled version. This is a workaround until ml-explore/mlx-lm#901 is merged upstream.

Test plan

Run add_mtp_weights_step3p5.py on a fresh mlx-community/Step-3.5-Flash-4bit download
Verify MTP weights in model-mtp.safetensors and updated config.json
Start server with --enable-mtp and confirm MTP speculative decoding works
Verify --reasoning-parser step3p5 correctly extracts <think> tags

🤖 Generated with Claude Code

Step 3.5 Flash is a 196B MoE model (288 experts, top-8 routing, ~11B active params) with 3 MTP prediction layers. The MLX community 4-bit conversion strips MTP weights and lacks MTP-aware modeling code. This adds: - scripts/add_mtp_weights_step3p5.py: Downloads BF16 MTP shards from the original model, extracts layers 45-47, remaps to mtp.layers.*, quantizes to 4-bit, and installs the MTP modeling file - scripts/modeling_step3p5_mtp.py: Full MLX-native model implementation with MTP support (Step3p5MTP, Step3p5MTPLayer, Step3p5SharedHead) - Reasoning parser alias "step3p5" (reuses deepseek_r1 <think> parser) - Documentation updates in README.md and docs/reference/models.md Note: The custom modeling file is a workaround until ml-explore/mlx-lm#901 is merged upstream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add Step 3.5 Flash model support with MTP#94

Add Step 3.5 Flash model support with MTP#94
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/step3p5-support

janhilgard commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

janhilgard commented Feb 16, 2026

Summary

Key differences from Qwen3-Next MTP (add_mtp_weights.py)

Usage

Note on custom modeling file

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Key differences from Qwen3-Next MTP (`add_mtp_weights.py`)