Rethinking the Trust Region in LLM Reinforcement Learning

Penghui Qi*, Xiangxin Zhou*, Zichen Liu Tianyu Pang, Chao Du, Min Lin, Wee Sun Lee

Overview

Figure 1: The plots show numerical differences between a training and an inference engine for Qwen3-30B-A3B-Base with identical parameters. (Left) The probability ratio (used in PPO) is highly volatile for low-probability tokens. (Right) In contrast, the TV divergence is more stable. This highlights a key flaw of PPO's clipping mechanism: it over-penalizes low-probability tokens, which can slow down learning; and under-penalizes high-probability tokens, which can permit large, destabilizing updates.

Figure 2: Comparison of PPO and the proposed DPPO (the Binary-TV variant). (Left) The surrogate objective and corresponding masks for PPO and DPPO. PPO (and variants like GRPO) employs a heuristic mask based on the probability ratio. In contrast, DPPO utilizes a more principled mask based on a direct approximation of policy divergence (e.g., Total Variation), ensuring updates stay within a theoretically grounded trust region. (Right) Experimental results on the AIME24 using Qwen3-30B-A3B-Base. DPPO significantly outperforms GRPO baselines, achieving superior training efficiency and stability even without rollout routing replay (R3).

Figure 3: DPPO variants achieve stable training while controlling the training-inference mismatch at a low level. In contrast, methods without a trust region (PG-IS, CISPO) or with a misspecified one (MiniRL) suffer from growing mismatch and eventual collapse.

Code

The code change for DPPO-Binary-TV/DPPO-Binary-KL is super simple, can refer to core_algo.py for implementation details.

This repo shows an example on how to implement the TopK divergence approximation.

Citation

If you find our works useful for your research, please consider citing:

@article{qi2026stablerl,
  title={Rethinking the Trust Region in LLM Reinforcement Learning},
  author={Qi, Penghui and Zhou, Xiangxin and Liu, Zichen and Pang, Tianyu and Du, Chao and Lin, Min and Lee, Wee Sun},
  journal={arXiv preprint arXiv:2602.04879},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,915 Commits
.gemini		.gemini
.github		.github
.vscode		.vscode
docker		docker
docs		docs
examples		examples
figures		figures
recipe @ 21892b9		recipe @ 21892b9
scripts		scripts
tests		tests
verl		verl
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rethinking the Trust Region in LLM Reinforcement Learning

Overview

Code

Citation

About

Uh oh!

Releases

Packages

Contributors 311

Uh oh!

Languages

License

sail-sg/Stable-RL

Folders and files

Latest commit

History

Repository files navigation

Rethinking the Trust Region in LLM Reinforcement Learning

Overview

Code

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 311

Uh oh!

Languages

Packages