We present the first systematic framework for converting a trained Mixture-of-Experts (MoE) language model into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. Our key finding is that expert scoring is the dominant design choice, and our diversity-aware DO-ACP scoring with pure pruning outperforms dense-to-dense pruning by +6.3 pp average downstream accuracy after ~4B-token distillation, at 1.6× faster training wall-clock speed.
Primary target: Qwen3-30B-A3B (128 experts, top-8 routing, 30B total / 3B active) → ~3.3B dense student. The codebase also covers three additional families for cross-model validation:
- DeepSeek-V2-Lite (64 routed + 2 shared experts, top-6)
- GPT-OSS-20B (32 experts, top-4)
- EMO / StdMoE 1B-active-14B pair (127 routed + 1 shared experts, top-7) — a matched modularity-aware vs. standard-MoE pair, used to study compatibility with compression-aware pretraining
The pipeline proceeds in three stages (Section 3 of the paper):
-
Score and select
- Score all experts per layer from teacher routing statistics on a small calibration set.
- Retain the top-
Kexperts by importance.
-
Group, merge, and concatenate
- Assign the top-
Kexperts tokgroups — copied directly whenK=k(pure pruning), or merged by score-weighted averaging whenK>k. - Concatenate the
kgroup representatives into a single dense FFN, with down-projection scaling to approximate the average routing behavior.
- Assign the top-
-
Distill
- Recover quality lost during compression by distilling the dense student from the original MoE teacher with forward-KL on logits.
The design space is parameterized along three axes. The code retains internal identifiers from development; the table below maps each to the term used in the paper.
| Axis | Paper term | Code identifier |
|---|---|---|
| Scoring | SF — selection frequency | selection_freq |
| PP — pre-selection probability | preselect_prob |
|
| PS — post-selection probability | postselect_prob |
|
| CP — conditional probability | conditional_prob |
|
| ACP — activation-weighted conditional probability | reap |
|
| DO-CP — D-optimal selection on CP | gram_logdet |
|
| DO-ACP — D-optimal selection on ACP (best) | gld_reap |
|
| Grouping | RR — round-robin | round_robin |
| WC — weight clustering | weight_cluster |
|
| RC — router clustering | router_cluster |
|
| AB — anchor-based | dominant_expert |
|
| OC — output clustering | output_cluster |
|
| DP scaling | uniform (1/k) |
uniform |
| proportional (by selected-expert importance) | proportional |
The D-optimal scorings (reap, gram_logdet, gld_reap) use a submodular log-determinant criterion on the expert-output Gram matrix to select a diverse subset; they require the Gram matrix and are produced by the full-grid pipeline (see Reproducing the paper). The four base scorings run directly through run_merging_ablation.py.
The distillation trainer is built on NVIDIA Model Optimizer's modelopt.torch.distill plugins, installed automatically as a dependency.
# clone repository
git clone https://github.com/krafton-ai/moe-to-dense
cd moe-to-dense
# install project (nvidia-modelopt, torch, transformers, datasets, trl)
pip install -e .
# optional extras
pip install -e ".[train]" # adds deepspeed
pip install -e ".[eval]" # adds vllm + lm-evalRequires Python ≥3.10. Distillation uses DeepSpeed ZeRO-2 by default; configs live in configs/. Dependency windows are pinned in pyproject.toml (notably nvidia-modelopt==0.37.0 and transformers>=4.51,<4.58); see the comments there for the rationale.
.
├── modules/ # Core library
│ ├── expert_importance.py # Routing-stat collection
│ ├── expert_merger.py # Scoring × grouping × scaling → dense FFN
│ ├── gram_logdet.py # D-optimal (log-det) selection + ACP scores
│ ├── dense_pruner.py # Dense-to-dense (D2D) pruning baseline
│ ├── distillation_trainer.py # Forward-KL KD trainer wrapper
│ ├── evaluation.py # Benchmark + WikiText-2 PPL eval
│ ├── checkpoint_manager.py
│ └── model_adapters.py # Qwen3 / DeepSeek-V2 / GPT-OSS / EMO adapters
├── scripts/ # Entry points (see below)
├── configs/ # DeepSpeed + training configs
└── pyproject.toml
The three canonical stages on a single configuration. Replace model IDs / paths as needed.
python scripts/run_merging_ablation.py \
--grid-sweep \
--model Qwen/Qwen3-30B-A3B \
--scoring conditional_prob \
--grouping round_robin \
--top-k 8This collects routing stats once and merges each (scoring, grouping, K) config into ./checkpoints/merging_ablation/<config>/, reporting pre-distill WikiText-2 perplexity for each. To build the paper's best student (DO-ACP, gld_reap), use the full-grid pipeline below — it collects the expert-output Gram matrix the D-optimal scorings need.
deepspeed --num_gpus=4 scripts/run_distillation.py \
--teacher Qwen/Qwen3-30B-A3B \
--student ./checkpoints/merging_ablation/<config>/ \
--output-dir ./checkpoints/distilled/<config>/ \
--deepspeed configs/ds_zero2_config.json \
--max-steps 3000 \
--warmup-steps 100 --decay-steps 300 --scheduler-type wsd \
--learning-rate 1e-4 --min-lr 1e-5 \
--per-device-batch-size 4 --gradient-accumulation-steps 24 \
--temperature 1.0Forward KL on logits (no intermediate / hidden-state loss) is the default and the paper's best objective. Pass --teacher-top-k 16 to distill from expanded teacher routing (k'=2k, +0.70 pp).
# benchmark accuracy (vLLM backend; Qwen-family models)
python scripts/evaluate_benchmark_vllm.py \
--model ./checkpoints/distilled/<config>/final \
--tasks winogrande,hellaswag,arc_easy,arc_challenge,mmlu
# WikiText-2 perplexity
python scripts/eval_checkpoint_ppl.py \
./checkpoints/distilled/<config>/finalFor DeepSeek-V2-Lite and GPT-OSS-20B, use the HuggingFace backend (vLLM is not supported for those families in this codebase).
End-to-end pipeline scripts assemble the three stages above plus baselines.
| Script | What it runs |
|---|---|
scripts/run_full_grid_pipeline.sh |
350-config pre-distill PPL sweep (7 scoring × 5 grouping × 2 DP scaling × 5 K) → distill the 35 best scoring×grouping pairs (0.3B tokens) |
scripts/run_scaleup_pipeline.sh |
Extended ~4B-token distillation of the four headline configs (DO-ACP, SF, D2D, Random FFN) |
scripts/run_deepseek_pipeline.sh |
DeepSeek-V2-Lite cross-model validation |
scripts/run_gptoss_pipeline.sh |
GPT-OSS-20B cross-model validation |
scripts/run_emo_pipeline.sh · scripts/run_stdmoe_pipeline.sh |
DO-ACP pure-pruning distillation on the matched modularity-aware / standard-MoE pair (Table 11); eval with run_emo_eval.sh + run_emo_wt2_ppl.sh |
scripts/run_base_ppl_sweep.py + run_base_distill_pipeline.sh + run_base_k8_pipeline.sh |
Base-teacher A/B (pre-distill PPL sweep → distill top configs) |
scripts/run_all_evals.sh |
Parallel benchmark evals across checkpoints |
Pipeline scripts honor the DEEPSPEED env var (default deepspeed) and read the HuggingFace cache from $HF_HOME. Checkpoints land under ./checkpoints/ (gitignored); expect tens to hundreds of GB depending on which pipeline you run.
Two baselines provide context for the MoE-to-dense results:
- D2D pruning (
modules/dense_pruner.py): dense-to-dense structured pruning of a matched-parameter dense teacher (Qwen3-32B → ~3.4B), distilled with its dense teacher under the same token budget. - Random FFN + teacher attention: copies the teacher's attention layers and randomly initializes the dense FFN — a lower bound on what distillation alone recovers.
@article{kim2026pruning,
title={Pruning and Distilling Mixture-of-Experts into Dense Language Models},
author={Kim, Junhyuck and Yun, Jihun and Kim, Haechan and Kim, Gyeongman and Bae, Joonghyun and Cho, Jaewoong},
journal={arXiv preprint arXiv:2605.28207},
year={2026}
}Apache License 2.0 — see LICENSE.
This work builds on NVIDIA Model Optimizer, whose modelopt.torch.distill plugins power our knowledge-distillation trainer.