Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
470 changes: 239 additions & 231 deletions docs/advance/fully_async.md

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion tests/special_e2e/run_fully_async_policy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ MODEL_PATH=${MODEL_PATH:-${HOME}/models/${MODEL_ID}}


rollout_mode="async"
rollout_name="sglang" # sglang or vllm
rollout_name="vllm" # sglang or vllm
if [ "$rollout_mode" = "async" ]; then
export VLLM_USE_V1=1
return_raw_chat="True"
Expand Down Expand Up @@ -123,6 +123,7 @@ common_params=(
trainer.resume_mode=disable
trainer.nnodes=1
trainer.n_gpus_per_node=${n_gpus_training}
trainer.log_val_generations=10
rollout.nnodes=1
rollout.n_gpus_per_node=${n_gpus_rollout}
rollout.total_rollout_steps=${total_rollout_steps}
Expand Down
93 changes: 48 additions & 45 deletions verl/experimental/fully_async_policy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

**Author:** `https://github.com/meituan-search`

Last updated: 12/25/2025.
Last updated: 02/05/2026.

This document introduces a fully asynchronous PPO training system that completely decouples the Trainer and Rollouter,
supporting asynchronous sample generation and training.
Expand Down Expand Up @@ -88,27 +88,27 @@ https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_rev

### Parameter Description

| super params | implication |
|-----------------------------------------------|------------------------------------------------------------------------------------------------|
| `trainer.nnodes` | Number of nodes for Trainer |
| `trainer.n_gpus_per_node` | Number of GPUs per node for Trainer |
| `rollout.nnodes` | Number of nodes for Rollouter |
| `rollout.n_gpus_per_node` | Number of GPUs per node for Rollouter |
| `data.train_batch_size` | In the fully async strategy, this value is not effective (default is 0) |
| `data.gen_batch_size` | In the fully async strategy, uses streaming sample production logic (default is 1) |
| `rollout.total_rollout_steps` | Total number of rollout samples |
| `rollout.test_freq` | How many times Rollouter updates parameters before performing a validation |
| `actor_rollout_ref.actor.ppo_mini_batch_size` | The ppo_mini_batch_size is a global num across all workers/gpus |
| `async_training.require_batches` | Number of ppo_mini_batch_size that FullyAsyncTrainer fetches at once |
| `async_training.trigger_parameter_sync_step` | Indicates how many local updates FullyAsyncTrainer performs before a parameter synchronization |
| `async_training.staleness_threshold` | Freshness control |
| `async_training.partial_rollout` | Whether to perform partial_rollout |
| `async_training.use_rollout_log_probs` | Use log_probs generated by rollout |
| `async_training.compute_prox_log_prob` | Whether to compute log_prob using the training model's parameters during the training phase. | |
| `async_training.checkpoint_engine.enable`| Whether to use checkpoint_engine for accelerating, default `True`|
| `async_training.checkpoint_engine.overlap_broadcast_and_consume` | When use checkpoint_engine, whether to overlap broadcast and load_weights, default `False`|
| `async_training.checkpoint_engine.device_buffer_size_M` | When use checkpoint_engine, the user-specific bucket size (MB), default `4096`|
| `async_training.use_trainer_do_validate` | Whether use trainer node to do validate process, default `False`|
| super params | implication |
|------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
| `trainer.nnodes` | Number of nodes for Trainer |
| `trainer.n_gpus_per_node` | Number of GPUs per node for Trainer |
| `rollout.nnodes` | Number of nodes for Rollouter |
| `rollout.n_gpus_per_node` | Number of GPUs per node for Rollouter |
| `data.train_batch_size` | In the fully async strategy, this value is not effective (default is 0) |
| `data.gen_batch_size` | In the fully async strategy, uses streaming sample production logic (default is 1) |
| `rollout.total_rollout_steps` | Total number of rollout samples |
| `rollout.test_freq` | How many times Rollouter updates parameters before performing a validation |
| `actor_rollout_ref.actor.ppo_mini_batch_size` | The ppo_mini_batch_size is a global num across all workers/gpus |
| `actor_rollout_ref.actor.use_rollout_log_probs=True` | Use log_probs generated by rollout |
| `algorithm.rollout_correction.bypass_mode` | Whether to compute log_prob using the training model's parameters during the training phase. |
| `async_training.require_batches` | Number of ppo_mini_batch_size that FullyAsyncTrainer fetches at once |
| `async_training.trigger_parameter_sync_step` | Indicates how many local updates FullyAsyncTrainer performs before a parameter synchronization |
| `async_training.staleness_threshold` | Freshness control |
| `async_training.partial_rollout` | Whether to perform partial_rollout |
| `async_training.checkpoint_engine.enable` | Whether to use checkpoint_engine for accelerating, default `True` |
| `async_training.checkpoint_engine.overlap_broadcast_and_consume` | When use checkpoint_engine, whether to overlap broadcast and load_weights, default `False` |
| `async_training.checkpoint_engine.device_buffer_size_M` | When use checkpoint_engine, the user-specific bucket size (MB), default `4096` |
| `async_training.use_trainer_do_validate` | Whether use trainer node to do validate process, default `False` |

**Further Explanation:**

Expand Down Expand Up @@ -151,14 +151,6 @@ https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_rev

partial_rollout only actually takes effect when staleness_threshold>0.

* `async_training.use_rollout_log_probs`

In reinforcement learning algorithms, log_probs have implicit correlations with parameter versions and tokens. Due to
the settings of algorithms like PPO/GRPO/DAPO, when calculating importance sampling,
old_log_prob must use the log_probs corresponding to the rollout parameters and tokens to ensure algorithm
correctness. In the fully
async strategy, we default to old_log_prob being calculated by rollout rather than by trainer.

* `async_training.require_batches`

In streaming training, require_batches should be set to 1, indicating that training is performed after producing
Expand All @@ -168,14 +160,25 @@ https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_rev
Here, we additionally provide require_batches for streaming distribution and control the number of samples
participating in training at once.

* `async_training.compute_prox_log_prob` (experimental)
* `actor_rollout_ref.actor.use_rollout_log_probs=True`

In reinforcement learning algorithms, log_probs have implicit correlations with parameter versions and tokens. Due to
the settings of algorithms like PPO/GRPO/DAPO, when calculating importance sampling,
old_log_prob must use the log_probs corresponding to the rollout parameters and tokens to ensure algorithm
correctness. In the fully
async strategy, we default to old_log_prob being calculated by rollout rather than by trainer.

* `algorithm.rollout_correction.bypass_mode`

> algorithm.rollout_correction.bypass_mode default is True, using rollout log prob.

During the training process, we observed that metrics and response lengths may become unstable in the later
stages of training. To mitigate this issue, we can use
the [Rollout Importance Sampling](https://verl.readthedocs.io/en/latest/advance/rollout_is.html)
technique for importance sampling. To utilize Rollout Importance Sampling, we need to compute log_prob using
the training engine, which requires enabling this switch.
Additionally, when compute_prox_log_prob and Rollout Importance Sampling are enabled under mode d
Additionally, when `algorithm.rollout_correction.bypass_mode=False` and Rollout Importance Sampling are enabled under
mode d
(async stream pipeline with partial rollout), our implementation approximates `Areal's Decoupled PPO`.

* `async_training.checkpoint_engine.enable`
Expand Down Expand Up @@ -332,7 +335,6 @@ python -m recipe.fully_async_policy.fully_async_main \
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.rollout.name=${rollout_name} \
actor_rollout_ref.rollout.mode=${rollout_mode} \
actor_rollout_ref.rollout.calculate_log_probs=True \
trainer.nnodes="${NNODES_TRAIN}" \
trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
rollout.nnodes="${NNODES_ROLLOUT}" \
Expand Down Expand Up @@ -473,14 +475,15 @@ future will be our next focus.

### checkpoint-engine Ablation Experiment
We tested the single-step parameter synchronization time of the checkpoint-engine on three models: Qwen2.5-Math-7B, Qwen3-30B-A3B, and Qwen3-235B-A22B, using default checkpoint-engine configurations. All experiments were performed on H20 machines, and the Megatron engine was used for training.
| model | trainer rank | rollout rank | checkpoint-engine | total sync time |
|:-----------------:|:--------:|:-------:|:--------------:|:--------------:|
| Qwen2.5-Math-7B | 4 | 4 | False | 0.12s |
| Qwen2.5-Math-7B | 4 | 4 | True | 0.02s |
| Qwen3-30B-A3B | 16 | 16 | False | 15.76s |
| Qwen3-30B-A3B | 16 | 16 | True | 4.38s |
| Qwen3-235B-A22B | 64 | 64 | False | 58.57s |
| Qwen3-235B-A22B | 64 | 64 | True | 23.70s |

| model | trainer rank | rollout rank | checkpoint-engine | total sync time |
|:---------------:|:--------------:|:-------------:|:-------------------:|:-----------------:|
| Qwen2.5-Math-7B | 4 | 4 | False | 0.12s |
| Qwen2.5-Math-7B | 4 | 4 | True | 0.02s |
| Qwen3-30B-A3B | 16 | 16 | False | 15.76s |
| Qwen3-30B-A3B | 16 | 16 | True | 4.38s |
| Qwen3-235B-A22B | 64 | 64 | False | 58.57s |
| Qwen3-235B-A22B | 64 | 64 | True | 23.70s |


### use_trainer_do_validate Experiment
Expand All @@ -505,10 +508,10 @@ We used Qwen2.5-Math-7B to verify the benefits of `use_trainer_do_validate=True`
* staleness_threshold: 0.5
* partial_rollout: True

| training mode | resource allocation | step | gen | old_log_prob | update_actor | validate time | total time<br>50 step | acc/mean@2 |
|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|
| colocate sync | 16 | 484.623 | 52.939 | 0 | 430.263 | 205.080 | 7h9m | 22.6 |
| fully_async_policy | 8:8 | 489.953 | 52.622 | 0 | 435.874 | 95.699 | 7h2m | 21.0 |
| training mode | resource allocation | step | gen | old_log_prob | update_actor | validate time | total time<br>50 step | acc/mean@2 |
|:------------------:|:-------------------:|:-------:|:-------:|:------------:|:------------:|:-------------:|:---------------------:|:----------:|
| colocate sync | 16 | 484.623 | 52.939 | 0 | 430.263 | 205.080 | 7h9m | 22.6 |
| fully_async_policy | 8:8 | 489.953 | 52.622 | 0 | 435.874 | 95.699 | 7h2m | 21.0 |


## Multi-Turn Tool Calling
Expand Down
Loading
Loading