Skip to content

Commit f085c0f

Browse files
authored
Merge branch 'main' into post-thumbnails
2 parents 3c1c5b4 + 39f972b commit f085c0f

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

_posts/2025-11-10-bitwise-consistent-train-inference.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,14 @@
22
layout: post
33
title: "No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan"
44
author: "vLLM and TorchTitan Teams"
5+
image: /assets/figures/2025-11-10-bitwise-exact-rl/reward-comparison.png
56
---
67

78
We demonstrate an open-source bitwise consistent on-policy RL run with [TorchTitan](https://github.com/pytorch/torchtitan) as the training engine and [vLLM](https://github.com/vllm-project/vllm) as the inference engine. Built on top of [vLLM's recent work on batch-invariant inference](https://docs.vllm.ai/en/latest/features/batch_invariance/), we show how to run an RL fine-tune of Qwen3 1.7B with bitwise matching training and inference numerics in [our open-sourced instructions](https://github.com/pytorch/torchtitan/tree/main/torchtitan/experiments/deterministic_vllm_rl):
89

910
![](/assets/figures/2025-11-10-bitwise-exact-rl/rl-script-demo.png)
1011

11-
Reinforcement learning has been shown to amplify tiny numerical mismatches between trainer and sampler, leading to non-deterministic and unstable training behavior ([He et al.](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/) & [Yao, Liu et al.](https://fengyao.notion.site/off-policy-rl)). We verified the impact of numerics on RL results with our results: Running the sampler with different kernels than the trainer (`batch_inv_OFF`) shows a reduced reward over 100 steps. Enabling bitwise exact training (`batch_inv_ON`, where `kl_div` always equals to 0.0), we see the model not only train in fewer steps, but reach a higher total reward.
12+
Reinforcement learning has been shown to amplify tiny numerical mismatches between trainer and sampler, leading to non-deterministic and unstable training behavior ([He et al.](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/), [Yao, Liu et al.](https://fengyao.notion.site/off-policy-rl) & [Liu, Li et al.](https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda)). We verified the impact of numerics on RL results with our results: Running the sampler with different kernels than the trainer (`batch_inv_OFF`) shows a reduced reward over 100 steps. Enabling bitwise exact training (`batch_inv_ON`, where `kl_div` always equals to 0.0), we see the model not only train in fewer steps, but reach a higher total reward.
1213

1314
![](/assets/figures/2025-11-10-bitwise-exact-rl/reward-comparison.png)
1415

0 commit comments

Comments
 (0)