Skip to content

OpenMOSS/DiRL

Repository files navigation

DiRL

An Efficient Training Framework for Diffusion Language Models

Ying Zhu1,2,3, Jiaxin Wan2, Tianyi Liang2,3, Xu Guo1,2, Xiaoran Liu1,2,3,
Zengfeng Huang1,2, Ziwei He2,3,†, Xipeng Qiu1,2,3,†

1Fudan University    2Shanghai Innovation Institute    3OpenMoss Team

Corresponding authors

GitHub Code Hugging Face Model Hugging Face Data License

Overview


🌟 TL;DR

We introduce DiRL, an open-source training framework for Diffusion Language Models (DLLMs) with SFT and RL stages. Using this framework, we train DiRL-8B-Instruct, achieving state-of-the-art results at the 8B scale on mathematical reasoning benchmarks, even outperforming 32B models on most tasks.

🌱 HighLights

  • 🎯 Novel RL Algorithm: We propose DiPO (Discrete Diffusion Policy Optimization), an RL algorithm that optimizes at the generation step level for DLLMs. It achieves unbiased implementation with complete consistency between optimization objectives and training process, and integrates dynamic sampling from DAPO during rollout to filter out low-quality data.

  • 🚀 Efficient Training & Inference: We support Accelerate framework for distributed training and LMDeploy inference engine for efficient rollout, while integrate Speed Reward mechanism to optimize inference speed at the training level, enabling both faster training and generation without sacrificing quality.

  • 🧠 SOTA Performance: We achieve state-of-the-art results at the 8B scale among both autoregressive (AR) models and diffusion language models (DLLMs) across multiple mathematical reasoning benchmarks. Specifically, we reach 83.05% on MATH500, 20.63% on AIME2024, and 20.83% on AIME2025, surpassing all 8B baselines and even outperforming the 32B Qwen2.5-32B-Instruct model on AIME benchmarks.

🧠 Method

We develop and release an open-source diffusion post-training framework for DLLMs, and train DiRL-8B-Instruct based on SDAR-8B-Chat through two stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, we adopt a random-masking strategy to construct the training data for model fine-tuning. In the RL stage, we design an RL algorithm -- DiPO (Discrete Diffusion Policy Optimization), which optimizes at the generation step level. We achieve an unbiased implementation of RL theory, ensuring complete consistency between the optimization objective and the actual training process. Additionally, during the rollout phase, we adopt dynamic sampling from DAPO to filter out data with zero advantage standard deviation. Through this two-stage training pipeline, we successfully train DiRL-8B-Instruct, a high-performance diffusion language model for mathematical reasoning.

📊 Performance

DiRL-8B-Instruct achieves state-of-the-art results among DLLMs across mathematical reasoning benchmarks. Highlights include 83.05% on MATH500 (surpassing the base model by +11.20%), 20.63% on AIME2024 and 20.83% on AIME2025 (dramatically outperforming all baselines), and 46.40% on OlympiadBench. Our 8B model achieves performance comparable to or exceeding much larger 32B models on most benchmarks.

Performance Comparison

🚀 Quick Start

Installation

git clone https://github.com/OpenMOSS/DiRL.git
cd DiRL
pip install -r requirements.txt

If flash-attn installation fails, you can download the pre-built wheel file and install it manually:

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

Download Models and Datasets

Edit download.sh to set your Hugging Face token and username, then run:

bash download.sh

Inference

from lmdeploy import pipeline, PytorchEngineConfig, GenerationConfig
from transformers import AutoTokenizer

if __name__ == '__main__':
  model_path = "OpenMOSS-Team/DiRL-8B-Instruct"

  # Load tokenizer
  tokenizer = AutoTokenizer.from_pretrained(model_path)

  # Prepare prompts
  prompts = [
      [{"role": "user", "content": "Solve: If x + 5 = 12, what is x?"}],
  ]
  prompts = tokenizer.apply_chat_template(prompts, tokenize=False, add_generation_prompt=True)

  # Configure backend for DLLM inference
  backend_config = PytorchEngineConfig(
      dtype="float16",
      max_prefill_token_num=8192,
      cache_max_entry_count=0.8,
      dllm_block_length=4,
      dllm_denoising_steps=4,
      dllm_unmasking_strategy="low_confidence_dynamic",
      dllm_confidence_threshold=0.9,
  )

  # Create inference pipeline
  with pipeline(model_path, backend_config=backend_config) as pipe:
      gen_config = GenerationConfig(
          top_p=1.0,
          top_k=50,
          temperature=1.0,
          do_sample=False,  # greedy decoding
          max_new_tokens=8192,
      )
      
      outputs = pipe(prompts, gen_config=gen_config)
      
      for output in outputs:
          print(output.text)

Evaluation

To evaluate models on multiple benchmarks (MATH500, GSM8K, AIME2024, AIME2025, OlympiadBench):

bash examples/eval.sh

Training

Step 1: Prepare Training Data

While the full DiRL-8B-Instruct training data is not yet released, we provide lightweight datasets for quick experimentation:

Tip: For initial experimentation, we recommend starting with max_new_tokens of 2K to reduce training time and resource requirements.

You can also create your own training datasets following the formats below:

SFT training data format:

[
  {
    "prompt": "<|im_start|>user\n[question]<|im_end|>\n<|im_start|>assistant\n",
    "response": "[answer]<|im_end|><|endoftext|>"
  }
]

RL training data format:

[
  {
    "question": "[question]",
    "ground_truth_answer": "[answer]"
  }
]

Step 2: Two-Stage Training

Stage 1: SFT Training

Supervised fine-tuning with random-masking strategy to adapt the base model for mathematical reasoning tasks.

bash examples/sft.sh

Stage 2: RL Training

Reinforcement learning with DiPO algorithm to optimize the model at generation step level.

bash examples/rl.sh

📋 Roadmap

  • Release Inference Engine and Training Framework
  • Release DiRL Technical Report
  • Release Training Data of DiRL-8B-Instruct
  • Release Thinking Model
  • Support More RL Algorithms
  • More Features are working in progress

👏 Acknowledgement

We would like to express our gratitude to the following works (SDAR, dllm-RL, lmdeploy) for providing important theoretical foundations and inspiration for DiRL.

💬 Community

Join our WeChat group to discuss DLLM training and related topics:

WeChat QR Code

📧 Contact

For issues or inquiries:

📖 Citation

If you find our work helpful, please consider citing:

@misc{zhu2025dirl,
  title={DiRL: An Efficient Training Framework for Diffusion Language Models},
  author={Zhu, Ying and Wan, Jiaxin and Liang, Tianyi and Guo, Xu and Liu, Xiaoran and Huang, Zengfeng and He, Ziwei and Qiu, Xipeng},
  year={2025},
  institution={Fudan University, Shanghai Innovation Institute},
  url={https://github.com/OpenMOSS/DiRL}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published