GitHub

An Efficient Training Framework for Diffusion Language Models

Ying Zhu^1,2,3, Jiaxin Wan², Tianyi Liang^2,3, Xu Guo^1,2, Xiaoran Liu^1,2,3,
Zengfeng Huang^1,2, Ziwei He^2,3,†, Xipeng Qiu^1,2,3,†

¹Fudan University ²Shanghai Innovation Institute ³OpenMoss Team

^†Corresponding authors

🌟 TL;DR

We introduce DiRL, an open-source training framework for Diffusion Language Models (DLLMs) with SFT and RL stages. Using this framework, we train DiRL-8B-Instruct, achieving state-of-the-art results at the 8B scale on mathematical reasoning benchmarks, even outperforming 32B models on most tasks.

🌱 HighLights

🎯 Novel RL Algorithm: We propose DiPO (Discrete Diffusion Policy Optimization), an RL algorithm that optimizes at the generation step level for DLLMs. It achieves unbiased implementation with complete consistency between optimization objectives and training process, and integrates dynamic sampling from DAPO during rollout to filter out low-quality data.
🚀 Efficient Training & Inference: We support Accelerate framework for distributed training and LMDeploy inference engine for efficient rollout, while integrate Speed Reward mechanism to optimize inference speed at the training level, enabling both faster training and generation without sacrificing quality.
🧠 SOTA Performance: We achieve state-of-the-art results at the 8B scale among both autoregressive (AR) models and diffusion language models (DLLMs) across multiple mathematical reasoning benchmarks. Specifically, we reach 83.05% on MATH500, 20.63% on AIME2024, and 20.83% on AIME2025, surpassing all 8B baselines and even outperforming the 32B Qwen2.5-32B-Instruct model on AIME benchmarks.

🧠 Method

We develop and release an open-source diffusion post-training framework for DLLMs, and train DiRL-8B-Instruct based on SDAR-8B-Chat through two stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, we adopt a random-masking strategy to construct the training data for model fine-tuning. In the RL stage, we design an RL algorithm -- DiPO (Discrete Diffusion Policy Optimization), which optimizes at the generation step level. We achieve an unbiased implementation of RL theory, ensuring complete consistency between the optimization objective and the actual training process. Additionally, during the rollout phase, we adopt dynamic sampling from DAPO to filter out data with zero advantage standard deviation. Through this two-stage training pipeline, we successfully train DiRL-8B-Instruct, a high-performance diffusion language model for mathematical reasoning.

📊 Performance

DiRL-8B-Instruct achieves state-of-the-art results among DLLMs across mathematical reasoning benchmarks. Highlights include 83.05% on MATH500 (surpassing the base model by +11.20%), 20.63% on AIME2024 and 20.83% on AIME2025 (dramatically outperforming all baselines), and 46.40% on OlympiadBench. Our 8B model achieves performance comparable to or exceeding much larger 32B models on most benchmarks.

🚀 Quick Start

Installation

git clone https://github.com/OpenMOSS/DiRL.git
cd DiRL
pip install -r requirements.txt

If flash-attn installation fails, you can download the pre-built wheel file and install it manually:

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

Download Models and Datasets

Edit download.sh to set your Hugging Face token and username, then run:

bash download.sh

Inference

from lmdeploy import pipeline, PytorchEngineConfig, GenerationConfig
from transformers import AutoTokenizer

if __name__ == '__main__':
  model_path = "OpenMOSS-Team/DiRL-8B-Instruct"

  # Load tokenizer
  tokenizer = AutoTokenizer.from_pretrained(model_path)

  # Prepare prompts
  prompts = [
      [{"role": "user", "content": "Solve: If x + 5 = 12, what is x?"}],
  ]
  prompts = tokenizer.apply_chat_template(prompts, tokenize=False, add_generation_prompt=True)

  # Configure backend for DLLM inference
  backend_config = PytorchEngineConfig(
      dtype="float16",
      max_prefill_token_num=8192,
      cache_max_entry_count=0.8,
      dllm_block_length=4,
      dllm_denoising_steps=4,
      dllm_unmasking_strategy="low_confidence_dynamic",
      dllm_confidence_threshold=0.9,
  )

  # Create inference pipeline
  with pipeline(model_path, backend_config=backend_config) as pipe:
      gen_config = GenerationConfig(
          top_p=1.0,
          top_k=50,
          temperature=1.0,
          do_sample=False,  # greedy decoding
          max_new_tokens=8192,
      )
      
      outputs = pipe(prompts, gen_config=gen_config)
      
      for output in outputs:
          print(output.text)

Evaluation

To evaluate models on multiple benchmarks (MATH500, GSM8K, AIME2024, AIME2025, OlympiadBench):

bash examples/eval.sh

Training

Step 1: Prepare Training Data

While the full DiRL-8B-Instruct training data is not yet released, we provide lightweight datasets for quick experimentation:

Light-OpenR1Math-SFT: 2K SFT samples from OpenR1Math
Light-MATH-RL: 4K RL samples from MATH

Tip: For initial experimentation, we recommend starting with max_new_tokens of 2K to reduce training time and resource requirements.

You can also create your own training datasets following the formats below:

SFT training data format:

[
  {
    "prompt": "<|im_start|>user\n[question]<|im_end|>\n<|im_start|>assistant\n",
    "response": "[answer]<|im_end|><|endoftext|>"
  }
]

RL training data format:

[
  {
    "question": "[question]",
    "ground_truth_answer": "[answer]"
  }
]

Step 2: Two-Stage Training

Stage 1: SFT Training

Supervised fine-tuning with random-masking strategy to adapt the base model for mathematical reasoning tasks.

bash examples/sft.sh

Stage 2: RL Training

Reinforcement learning with DiPO algorithm to optimize the model at generation step level.

bash examples/rl.sh

📋 Roadmap

Release Inference Engine and Training Framework
Release DiRL Technical Report
Release Training Data of DiRL-8B-Instruct
Release Thinking Model
Support More RL Algorithms
More Features are working in progress

👏 Acknowledgement

We would like to express our gratitude to the following works (SDAR, dllm-RL, lmdeploy) for providing important theoretical foundations and inspiration for DiRL.

💬 Community

Join our WeChat group to discuss DLLM training and related topics:

📧 Contact

For issues or inquiries:

Ying Zhu, Shanghai Innovation Institute ([email protected])

📖 Citation

If you find our work helpful, please consider citing:

@misc{zhu2025dirl,
  title={DiRL: An Efficient Training Framework for Diffusion Language Models},
  author={Zhu, Ying and Wan, Jiaxin and Liang, Tianyi and Guo, Xu and Liu, Xiaoran and Huang, Zengfeng and He, Ziwei and Qiu, Xipeng},
  year={2025},
  institution={Fudan University, Shanghai Innovation Institute},
  url={https://github.com/OpenMOSS/DiRL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
accelerate_configs		accelerate_configs
configs		configs
data		data
eval		eval
examples		examples
lmdeploy		lmdeploy
models		models
public		public
reasoning		reasoning
scripts		scripts
static		static
train		train
utils		utils
LICENSE		LICENSE
README.md		README.md
download.sh		download.sh
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

An Efficient Training Framework for Diffusion Language Models

🌟 TL;DR

🌱 HighLights

🧠 Method

📊 Performance

🚀 Quick Start

Installation

Download Models and Datasets

Inference

Evaluation

Training

📋 Roadmap

👏 Acknowledgement

💬 Community

📧 Contact

📖 Citation

About

Uh oh!

Releases

Packages

Languages

License

OpenMOSS/DiRL

Folders and files

Latest commit

History

Repository files navigation

An Efficient Training Framework for Diffusion Language Models

🌟 TL;DR

🌱 HighLights

🧠 Method

📊 Performance

🚀 Quick Start

Installation

Download Models and Datasets

Inference

Evaluation

Training

📋 Roadmap

👏 Acknowledgement

💬 Community

📧 Contact

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages