Skip to content

[arXiv:2508.00410] "Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement"

License

Notifications You must be signed in to change notification settings

tmlr-group/Co-Reward

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Co-Reward Logo

Co-Reward: Self-Supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement

Paper Liscence Stars

Our current version can be found in 📄Paper.

Pipeline

Co-Reward is a self-supervised reinforcement learning method for LLM reasoning, which leverages contrastive agreement between original and rephrased questions, enabling them to serve as reward signals for each other during training. It effectively mitigates the training collapse of existing self-reward reasoning methods, such as Entroy Minimization, Intuitor, and Majority-Voting.

Performance

Install Environment

# 1. create a clean conda environment
conda create -y -n coreward python=3.10
conda activate coreward

# 2. clone the repository
git clone https://github.com/tmlr-group/Co-Reward.git
cd Co-Reward

# 3. install external dependencies
cd coreward
bash scripts/install_env.sh

# 4. add Coreward to PYTHONPATH in editable mode
pip install -e . --no-deps

Training on MATH Dataset

Modify the WANDB_KEY in the coreward/math_co_reward.sh script to your own WANDB key, then run the following command:

cd coreward
bash math_co_reward.sh

Preprocess the Training Data

First, download the MATH dataset and prepare it using the following Python script:

python examples/data_preprocess/math_dataset.py

Second, rephrase the training set using Qwen3-32B as follows:

python rewrite_questions.py \
  --input_path data/math/train.parquet \
  --output_jsonl data/math/train_rewrite_Qwen3-32B.jsonl \
  --output_parquet data/math/train_rewrite_Qwen3-32B.parquet \
  --output_original_parquet data/math/train_original.parquet \
  --model_path $YOUR_Qwen3-32B_MODEL_PATH \
  --tokenizer_path $YOUR_Qwen3-32B_TOKENIZER_PATH \
  --question_column prompt \
  --batch_size 128

Then, you can train your LLM using Co-Reward following above script.

Dataset

We release our rephrased MATH training set on TMLR-Group-HF/CoReward-RephrasedMATH.

Checkpoints

We release all checkpoints trained by us, including our Co-Reward and baselines.

Checkpoints of Co-Reward

Model Name Model Size Method Hugging Face Link
TMLR-Group-HF/CoReward-Qwen2.5-3B 3B Co-Reward View Model
TMLR-Group-HF/CoReward-Qwen2.5-7B 7B Co-Reward View Model
TMLR-Group-HF/CoReward-Qwen3-1.7B-Base 1.7B Co-Reward View Model
TMLR-Group-HF/CoReward-Qwen3-4B-Base 4B Co-Reward View Model
TMLR-Group-HF/CoReward-Qwen3-8B-Base 8B Co-Reward View Model
TMLR-Group-HF/CoReward-Llama-3.2-3B-Instruct 3B Co-Reward View Model

Checkpoints of Ground-Truth GRPO (GT-GRPO)

Model Name Model Size Method Hugging Face Link
TMLR-Group-HF/GT-Qwen2.5-3B 3B GT-GRPO View Model
TMLR-Group-HF/GT-Qwen2.5-7B 7B GT-GRPO View Model
TMLR-Group-HF/GT-Qwen3-1.7B-Base 1.7B GT-GRPO View Model
TMLR-Group-HF/GT-Qwen3-4B-Base 4B GT-GRPO View Model
TMLR-Group-HF/GT-Qwen3-8B-Base 8B GT-GRPO View Model
TMLR-Group-HF/GT-Llama-3.2-3B-Instruct 3B GT-GRPO View Model

Checkpoints of Self-Certainty

Model Name Model Size Method Hugging Face Link
TMLR-Group-HF/Self-Certainty-Qwen2.5-3B 3B Self-Certainty View Model
TMLR-Group-HF/Self-Certainty-Qwen2.5-7B 7B Self-Certainty View Model
TMLR-Group-HF/Self-Certainty-Qwen3-1.7B-Base 1.7B Self-Certainty View Model
TMLR-Group-HF/Self-Certainty-Qwen3-4B-Base 4B Self-Certainty View Model
TMLR-Group-HF/Self-Certainty-Qwen3-8B-Base 8B Self-Certainty View Model
TMLR-Group-HF/Self-Certainty-Llama-3.2-3B-Instruct 3B Self-Certainty View Model

Checkpoints of Entropy Minimization

Model Name Model Size Method Hugging Face Link
TMLR-Group-HF/Entropy-Qwen2.5-3B 3B Entropy View Model
TMLR-Group-HF/Entropy-Qwen2.5-7B 7B Entropy View Model
TMLR-Group-HF/Entropy-Qwen3-1.7B-Base 1.7B Entropy View Model
TMLR-Group-HF/Entropy-Qwen3-4B-Base 4B Entropy View Model
TMLR-Group-HF/Entropy-Qwen3-8B-Base 8B Entropy View Model
TMLR-Group-HF/Entropy-Llama-3.2-3B-Instruct 3B Entropy View Model

Checkpoints of Majority-Voting

Model Name Model Size Method Hugging Face Link
TMLR-Group-HF/Majority-Voting-Qwen2.5-3B 3B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Qwen2.5-7B 7B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Qwen3-1.7B-Base 1.7B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Qwen3-4B-Base 4B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base 8B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Llama-3.2-3B-Instruct 3B Majority-Voting View Model

TODO

This is an initial version of the code. We will make the following updates in the future.

  • [Models] Release all of our trained LLM checkpoints
  • [Code] Update the evaluation code
  • [Paper] Update the Arxiv paper link
  • [Environment] Update the runing environment file
  • [Readme] Update the README

📄 Citation

If you use our code or data, please cite our paper 📄!

@article{zhang2025coreward,
  title={Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement}, 
  author={Zizhuo Zhang and Jianing Zhu and Xinmu Ge and Zihua Zhao and Zhanke Zhou and Xuan Li and Xiao Feng and Jiangchao Yao and Bo Han},
  journal={arXiv preprint arXiv:2508.00410}
  year={2025},
}

Please give us a Star, thanks very much for your focus on our work!!

About

[arXiv:2508.00410] "Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.8%
  • Shell 0.2%