Skip to content

FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models

Notifications You must be signed in to change notification settings

nick7nlp/FastCuRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 

Repository files navigation

FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models

Paper Hugging Face Dataset

We release FastCuRL-1.5B-Preview, a slow-thinking reasoning model that outperforms 📈 the previous SoTA DeepScaleR-1.5B-Preview with 🚀 50% training steps! We adapt a novel curriculum-guided iterative lengthening reinforcement learning to the DeepSeek-R1-Distill-Qwen-1.5B and observe continuous performance improvement as training steps increase. To better reproduce our work and advance research progress, we open-source our code, model, and data.

The current version of the uploaded paper is unfinished and will be updated. We aim to share some findings in time.

Key Results

Training Details.

Model Training Steps Training Stages Number of GPUs Used in Each Stage
DeepScaleR-1.5B-Preview ~1,750 3 8, 16, 32
FastCuRL-1.5B-Preview ~860 4 8, 8, 8, 8

Here, we uniformly set the batch size to 128 for counting training steps, meaning two steps with batch size 64 are counted as one with batch size 128.

We report Pass@1 accuracy averaged over 16 samples for each problem.

Model AIME 2024 MATH 500 AMC 2023 Minerva Math OlympiadBench Avg.
Qwen2.5-Math-7B-Instruct 13.3 79.8 50.6 34.6 40.7 43.8
rStar-Math-7B 26.7 78.4 47.5 - 47.1 -
Eurus-2-7B-PRIME 26.7 79.2 57.8 38.6 42.1 48.9
Qwen2.5-7B-SimpleRL 26.7 82.4 62.5 39.7 43.3 50.9
DeepSeek-R1-Distill-Qwen-1.5B 28.8 82.8 62.9 26.5 43.3 48.9
Still-1.5B 32.5 84.4 66.7 29.0 45.4 51.6
DeepScaleR-1.5B-Preview 43.1 87.8 73.6 30.2 50.0 57.0
FastCuRL-1.5B-Preview 43.1 88.0 74.2 31.6 50.4 57.5

Training Data

Following DeepScaleR, our training dataset consists of 40,315 unique problem-answer pairs compiled from:

  • AIME problems (1984-2023)
  • AMC problems (before 2023)
  • Omni-MATH dataset
  • Still dataset

In FastCuRL, we propose a simple condition-sensitive data segmentation approach, which splits the original dataset into three subsets.

Segmented Datasets

Training Strategy

In FastCuRL, we propose a curriculum-guided iterative lengthening approach for improving the RL training efficiency of R1-like reasoning models. Specifically, the four stages are as follows:

  • Stage I (8K context,∼160 steps)

  • Stage II (16K context,∼590 steps)

  • Stage III (24K context,∼230 steps)

  • Stage IV (16K context,∼580 steps)

Overall, we find that during the whole training process, the steps chosen for stage transitions mainly occurred toward the end of each stage, further highlighting the efficiency of the proposed FastCuRL approach.

Entropy Loss

We observe the changes in entropy loss during the training process and find differences between our training strategy and that of DeepScaleR.

Training Scripts

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export VLLM_ATTENTION_BACKEND=XFORMERS

# Run 8K context length training, 160 steps
bash ./scripts/train/run_fastcurl_1.5b_8k_stage1.sh | tee -a fastcurl-1.5b-stage1.log

# Run 16K context length training, 590 steps
bash ./scripts/train/run_fastcurl_1.5b_16k_stage2.sh | tee -a fastcurl-1.5b-stage2.log

# Run 24K context length training, 230 steps
bash ./scripts/train/run_fastcurl_1.5b_24k_stage3.sh | tee -a fastcurl-1.5b-stage3.log

# Run 16K context length training, 580 steps
bash ./scripts/train/run_fastcurl_1.5b_16k_stage4.sh | tee -a fastcurl-1.5b-stage4.log

Evaluate

python3 -m verl.trainer.main_generation \
    trainer.nnodes=1 \
    trainer.n_gpus_per_node=8 \
    data.path=./fastcurl/data/test/xxx.parquet \
    data.output_path=${OUTPUT_DIR}/xxx.parquet \
    data.n_samples=16 \
    data.batch_size=2048 \
    model.path=${MODEL_PATH} \
    rollout.temperature=0.6 \
    rollout.response_length=32768 \
    rollout.top_k=-1 \
    rollout.top_p=1 \
    rollout.gpu_memory_utilization=0.9 \
    rollout.tensor_model_parallel_size=1

Citation

@misc{fastcurl,
      title={FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models}, 
      author={Mingyang Song and Mao Zheng and Zheng Li and Wenjie Yang and Xuan Luo and Yue Pan and Feng Zhang},
      year={2025},
      eprint={2503.17287},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.17287}, 
}

Acknowledgements

  • Our model is trained on top of DeepSeek-R1-Distill-Qwen-1.5B.
  • Our training experiments are powered by our heavily modified fork of verl.
  • We directly use DeepScaleR's code to finish our experiments. However, we have modified parts of the code related to naming conflicts to avoid confusion.

About

FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published