Skip to content

Rethinking the Trust Region in LLM Reinforcement Learning

License

Notifications You must be signed in to change notification settings

sail-sg/Stable-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,915 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rethinking the Trust Region in LLM Reinforcement Learning

Penghui Qi*, Xiangxin Zhou*, Zichen Liu Tianyu Pang, Chao Du, Min Lin, Wee Sun Lee

Paper Github

Overview

Figure 1: The plots show numerical differences between a training and an inference engine for Qwen3-30B-A3B-Base with identical parameters. (Left) The probability ratio (used in PPO) is highly volatile for low-probability tokens. (Right) In contrast, the TV divergence is more stable. This highlights a key flaw of PPO's clipping mechanism: it over-penalizes low-probability tokens, which can slow down learning; and under-penalizes high-probability tokens, which can permit large, destabilizing updates.

Figure 2: Comparison of PPO and the proposed DPPO (the Binary-TV variant). (Left) The surrogate objective and corresponding masks for PPO and DPPO. PPO (and variants like GRPO) employs a heuristic mask based on the probability ratio. In contrast, DPPO utilizes a more principled mask based on a direct approximation of policy divergence (e.g., Total Variation), ensuring updates stay within a theoretically grounded trust region. (Right) Experimental results on the AIME24 using Qwen3-30B-A3B-Base. DPPO significantly outperforms GRPO baselines, achieving superior training efficiency and stability even without rollout routing replay (R3).

Figure 3: DPPO variants achieve stable training while controlling the training-inference mismatch at a low level. In contrast, methods without a trust region (PG-IS, CISPO) or with a misspecified one (MiniRL) suffer from growing mismatch and eventual collapse.

Code

The code change for DPPO-Binary-TV/DPPO-Binary-KL is super simple, can refer to core_algo.py for implementation details.

This repo shows an example on how to implement the TopK divergence approximation.

Citation

If you find our works useful for your research, please consider citing:

@article{qi2026stablerl,
  title={Rethinking the Trust Region in LLM Reinforcement Learning},
  author={Qi, Penghui and Zhou, Xiangxin and Liu, Zichen and Pang, Tianyu and Du, Chao and Lin, Min and Lee, Wee Sun},
  journal={arXiv preprint arXiv:2602.04879},
  year={2026}
}

About

Rethinking the Trust Region in LLM Reinforcement Learning

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 311