Skip to content

ASTRAL-Group/data-efficient-llm-rl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

35 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš€ Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

Official Implementation of "Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay"(NeurIPS 2025).

RL fine-tuning for LLMs is notoriously expensive ๐Ÿ’ธ. We present two simple yet effective techniques to improve data efficiency:

(1) Difficulty-targeted Online daTa Selection (DOTS) -> fewer steps to match original GRPO ๐Ÿš€

(2) Rollout Replay (RR) -> lower per-step compute โšก๏ธ

Framework

Experiments on six LLMโ€“dataset combinations show that our method reduces RL fine-tuning time by 25% to 65% while achieving the same performance as the original GRPO algorithm โณ.

Results

โš™๏ธ Installation

The library has been designed and tested with Python 3.10 and CUDA 12.1. First, ensure that CUDA 12.1 is installed, then run the following commands:

conda create --name rl_data python=3.10
conda activate rl_data

cd rl_training
pip install -e ./verl
pip install -r requirements.txt

๐Ÿง  Training the Adaptive Difficulty Prediction Framework

The core of our difficulty-targeted online data selection lies in the attention-based adaptive difficulty prediction framework. To achieve this efficiently, we freeze a backbone LLM model (e.g., Qwen2.5-Math-1.5B-Instruct) and augment it with a lightweight adapter and a calibration head. The training process can be launched as follows:

  1. Prepare the training data

    In adaptive_difficulty_prediction/load_data.py, replace data_train.pkl and data_ref.pkl with your customized datasets.
    You can refer to the example formats provided in the adaptive_difficulty_prediction/datasets/ directory.

  2. Launch embedding inference and training

    cd adaptive_difficulty_prediction
    bash run_bash/run_embed.sh
    bash run_bash/run_train.sh

โ™ป๏ธ Data-efficient RL Training with DOTS and RR

To launch RL training with DOTS and RR, run the following example script (Qwen2.5-Math-1.5B on DeepScaler):

cd rl_training
bash run_bash/final_ds_teacher_replay.sh

The adaptive_difficulty_prediction directory includes the open-sourced adaptive difficulty prediction model and the corresponding embeddings for the Deepscaler datasets.

๐Ÿ™ Acknowledgement

Part of our codes is based on rllm and verl. We gratefully acknowledge their contributions.

๐Ÿ“š Citation

If you find our paper helpful, please kindly consider citing our paper in your publication.

@article{sun2025improving,
  title={Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay},
  author={Sun, Yifan and Shen, Jingyan and Wang, Yibin and Chen, Tianyu and Wang, Zhendong and Zhou, Mingyuan and Zhang, Huan},
  journal={arXiv preprint arXiv:2506.05316},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published