[recipe] refactor: refactor ray trainer for separate recipe use. (fully async / one step off) #5184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

ArronHZG wants to merge 11 commits into verl-project:main from meituan-search:refactor_fully_async_ray_trainer

0 ...orkflows/stash/e2e_fully_async_policy.yml → .github/workflows/e2e_fully_async_policy.yml

File renamed without changes.

0 ...rkflows/stash/e2e_one_step_off_policy.yml → ...hub/workflows/e2e_one_step_off_policy.yml

File renamed without changes.

0 .../stash/e2e_one_step_off_policy_ascend.yml → ...kflows/e2e_one_step_off_policy_ascend.yml

File renamed without changes.

docs/advance/fully_async.md

Large diffs are not rendered by default.

tests/special_e2e/run_fully_async_policy.sh

-Original file line number
+Diff line change
@@ Expand Up / @@ -15,7 +15,7 @@ MODEL_PATH=${MODEL_PATH:-${HOME}/models/${MODEL_ID}} @@
     rollout_mode="async"
-    rollout_name="sglang" # sglang or vllm
+    rollout_name="vllm" # sglang or vllm
     if [ "$rollout_mode" = "async" ]; then
         export VLLM_USE_V1=1
         return_raw_chat="True"
@@ Expand Down Expand Up / @@ -123,6 +123,7 @@ common_params=( @@
         trainer.resume_mode=disable
         trainer.nnodes=1
         trainer.n_gpus_per_node=${n_gpus_training}
+        trainer.log_val_generations=10
         rollout.nnodes=1
         rollout.n_gpus_per_node=${n_gpus_rollout}
         rollout.total_rollout_steps=${total_rollout_steps}
@@ Expand Down @@

verl/experimental/fully_async_policy/README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -2,7 +2,7 @@
  
    **Author:** `https://github.com/meituan-search`

    Last updated: 12/25/2025.

    Last updated: 02/05/2026.

    This document introduces a fully asynchronous PPO training system that completely decouples the Trainer and Rollouter,

    supporting asynchronous sample generation and training.

    @@ -88,27 +88,27 @@ https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_rev
  
    ### Parameter Description

    | super params                                  | implication                                                                                    |

    |-----------------------------------------------|------------------------------------------------------------------------------------------------|

    | `trainer.nnodes`                              | Number of nodes for Trainer                                                                    |

    | `trainer.n_gpus_per_node`                     | Number of GPUs per node for Trainer                                                            |

    | `rollout.nnodes`                              | Number of nodes for Rollouter                                                                  |

    | `rollout.n_gpus_per_node`                     | Number of GPUs per node for Rollouter                                                          |

    | `data.train_batch_size`                       | In the fully async strategy, this value is not effective (default is 0)                        |

    | `data.gen_batch_size`                         | In the fully async strategy, uses streaming sample production logic (default is 1)             |

    | `rollout.total_rollout_steps`                 | Total number of rollout samples                                                                |

    | `rollout.test_freq`                           | How many times Rollouter updates parameters before performing a validation                     |

    | `actor_rollout_ref.actor.ppo_mini_batch_size` | The ppo_mini_batch_size is a global num across all workers/gpus                                |

    | `async_training.require_batches`              | Number of ppo_mini_batch_size that FullyAsyncTrainer fetches at once                           |

    | `async_training.trigger_parameter_sync_step`  | Indicates how many local updates FullyAsyncTrainer performs before a parameter synchronization |

    | `async_training.staleness_threshold`          | Freshness control                                                                              |

    | `async_training.partial_rollout`              | Whether to perform partial_rollout                                                             |

    | `async_training.use_rollout_log_probs`        | Use log_probs generated by rollout                                                             |

    | `async_training.compute_prox_log_prob`        | Whether to compute log_prob using the training model's parameters during the training phase.   |                                                |

    | `async_training.checkpoint_engine.enable`| Whether to use checkpoint_engine for accelerating, default `True`|

    | `async_training.checkpoint_engine.overlap_broadcast_and_consume` | When use checkpoint_engine, whether to overlap broadcast and load_weights, default `False`|

    | `async_training.checkpoint_engine.device_buffer_size_M` | When use checkpoint_engine, the user-specific bucket size (MB), default `4096`|

    | `async_training.use_trainer_do_validate` | Whether use trainer node to do validate process, default `False`|

    | super params                                                     | implication                                                                                    |

    |------------------------------------------------------------------|------------------------------------------------------------------------------------------------|

    | `trainer.nnodes`                                                 | Number of nodes for Trainer                                                                    |

    | `trainer.n_gpus_per_node`                                        | Number of GPUs per node for Trainer                                                            |

    | `rollout.nnodes`                                                 | Number of nodes for Rollouter                                                                  |

    | `rollout.n_gpus_per_node`                                        | Number of GPUs per node for Rollouter                                                          |

    | `data.train_batch_size`                                          | In the fully async strategy, this value is not effective (default is 0)                        |

    | `data.gen_batch_size`                                            | In the fully async strategy, uses streaming sample production logic (default is 1)             |

    | `rollout.total_rollout_steps`                                    | Total number of rollout samples                                                                |

    | `rollout.test_freq`                                              | How many times Rollouter updates parameters before performing a validation                     |

    | `actor_rollout_ref.actor.ppo_mini_batch_size`                    | The ppo_mini_batch_size is a global num across all workers/gpus                                |

    | `actor_rollout_ref.actor.use_rollout_log_probs=True`             | Use log_probs generated by rollout                                                             |

    | `algorithm.rollout_correction.bypass_mode`                       | Whether to compute log_prob using the training model's parameters during the training phase.   |

    | `async_training.require_batches`                                 | Number of ppo_mini_batch_size that FullyAsyncTrainer fetches at once                           |

    | `async_training.trigger_parameter_sync_step`                     | Indicates how many local updates FullyAsyncTrainer performs before a parameter synchronization |

    | `async_training.staleness_threshold`                             | Freshness control                                                                              |

    | `async_training.partial_rollout`                                 | Whether to perform partial_rollout                                                             |

    | `async_training.checkpoint_engine.enable`                        | Whether to use checkpoint_engine for accelerating, default `True`                              |

    | `async_training.checkpoint_engine.overlap_broadcast_and_consume` | When use checkpoint_engine, whether to overlap broadcast and load_weights, default `False`     |

    | `async_training.checkpoint_engine.device_buffer_size_M`          | When use checkpoint_engine, the user-specific bucket size (MB), default `4096`                 |

    | `async_training.use_trainer_do_validate`                         | Whether use trainer node to do validate process, default `False`                               |

    **Further Explanation:**

    @@ -151,14 +151,6 @@ https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_rev
  
      partial_rollout only actually takes effect when staleness_threshold>0.

    * `async_training.use_rollout_log_probs`

      In reinforcement learning algorithms, log_probs have implicit correlations with parameter versions and tokens. Due to

      the settings of algorithms like PPO/GRPO/DAPO, when calculating importance sampling,

      old_log_prob must use the log_probs corresponding to the rollout parameters and tokens to ensure algorithm

      correctness. In the fully

      async strategy, we default to old_log_prob being calculated by rollout rather than by trainer.

    * `async_training.require_batches`

      In streaming training, require_batches should be set to 1, indicating that training is performed after producing

    @@ -168,14 +160,25 @@ https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_rev
  
      Here, we additionally provide require_batches for streaming distribution and control the number of samples

      participating in training at once.

    * `async_training.compute_prox_log_prob` (experimental)

    * `actor_rollout_ref.actor.use_rollout_log_probs=True`

      In reinforcement learning algorithms, log_probs have implicit correlations with parameter versions and tokens. Due to

      the settings of algorithms like PPO/GRPO/DAPO, when calculating importance sampling,

      old_log_prob must use the log_probs corresponding to the rollout parameters and tokens to ensure algorithm

      correctness. In the fully

      async strategy, we default to old_log_prob being calculated by rollout rather than by trainer.

    * `algorithm.rollout_correction.bypass_mode`

      > algorithm.rollout_correction.bypass_mode default is True, using rollout log prob.

      During the training process, we observed that metrics and response lengths may become unstable in the later

      stages of training. To mitigate this issue, we can use

      the [Rollout Importance Sampling](https://verl.readthedocs.io/en/latest/advance/rollout_is.html)

      technique for importance sampling. To utilize Rollout Importance Sampling, we need to compute log_prob using

      the training engine, which requires enabling this switch.

      Additionally, when compute_prox_log_prob and Rollout Importance Sampling are enabled under mode d

      Additionally, when `algorithm.rollout_correction.bypass_mode=False` and Rollout Importance Sampling are enabled under

      mode d

      (async stream pipeline with partial rollout), our implementation approximates `Areal's Decoupled PPO`.

    * `async_training.checkpoint_engine.enable`

    @@ -332,7 +335,6 @@ python -m recipe.fully_async_policy.fully_async_main \
  
        actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \

        actor_rollout_ref.rollout.name=${rollout_name} \

        actor_rollout_ref.rollout.mode=${rollout_mode} \

        actor_rollout_ref.rollout.calculate_log_probs=True \

        trainer.nnodes="${NNODES_TRAIN}" \

        trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \

        rollout.nnodes="${NNODES_ROLLOUT}" \

    @@ -473,14 +475,15 @@ future will be our next focus.
  
    ### checkpoint-engine Ablation Experiment

    We tested the single-step parameter synchronization time of the checkpoint-engine on three models: Qwen2.5-Math-7B, Qwen3-30B-A3B, and Qwen3-235B-A22B, using default checkpoint-engine configurations. All experiments were performed on H20 machines, and the Megatron engine was used for training.

    | model |  trainer rank 	  | rollout rank	  | checkpoint-engine 	 | total sync time 	 |

    |:-----------------:|:--------:|:-------:|:--------------:|:--------------:|

    | Qwen2.5-Math-7B   | 4        | 4       | False      | 0.12s      |

    | Qwen2.5-Math-7B   | 4        | 4       | True      | 0.02s      |

    |  Qwen3-30B-A3B     | 16        | 16       | False      | 15.76s   |

    |  Qwen3-30B-A3B     | 16        | 16       | True      | 4.38s   |

    |  Qwen3-235B-A22B    | 64        | 64       | False      | 58.57s   |

    |  Qwen3-235B-A22B    | 64        | 64       | True      | 23.70s   |

    |      model      | trainer rank 	 | rollout rank	 | checkpoint-engine 	 | total sync time 	 |

    |:---------------:|:--------------:|:-------------:|:-------------------:|:-----------------:|

    | Qwen2.5-Math-7B |       4        |       4       |        False        |       0.12s       |

    | Qwen2.5-Math-7B |       4        |       4       |        True         |       0.02s       |

    |  Qwen3-30B-A3B  |       16       |      16       |        False        |      15.76s       |

    |  Qwen3-30B-A3B  |       16       |      16       |        True         |       4.38s       |

    | Qwen3-235B-A22B |       64       |      64       |        False        |      58.57s       |

    | Qwen3-235B-A22B |       64       |      64       |        True         |      23.70s       |

    ### use_trainer_do_validate Experiment

    @@ -505,10 +508,10 @@ We used Qwen2.5-Math-7B to verify the benefits of `use_trainer_do_validate=True`
  
        * staleness_threshold: 0.5

        * partial_rollout: True

    |  training mode  | resource allocation | step  |  gen  | old_log_prob | update_actor | validate time | total time<br>50 step | acc/mean@2 |

    |:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|

    | colocate sync      | 16  |  484.623  |  52.939	 |   0	 |   430.263   |  205.080  	 |     7h9m  	 |     22.6     |

    | fully_async_policy | 8:8 |  489.953  |  52.622	 |   0	 |   435.874   |  95.699  	 |     7h2m  	 |     21.0    |

    |   training mode    | resource allocation |  step   |   gen   | old_log_prob | update_actor | validate time | total time<br>50 step | acc/mean@2 |

    |:------------------:|:-------------------:|:-------:|:-------:|:------------:|:------------:|:-------------:|:---------------------:|:----------:|

    |   colocate sync    |         16          | 484.623 | 52.939	 |      0	      |   430.263    |  205.080  	   |        7h9m  	        |    22.6    |

    | fully_async_policy |         8:8         | 489.953 | 52.622	 |      0	      |   435.874    |   95.699  	   |        7h2m  	        |    21.0    |

    ## Multi-Turn Tool Calling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[recipe] refactor: refactor ray trainer for separate recipe use. (fully async / one step off) #5184

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

Uh oh!

[recipe] refactor: refactor ray trainer for separate recipe use. (fully async / one step off) #5184

Are you sure you want to change the base?

Uh oh!

[recipe] refactor: refactor ray trainer for separate recipe use. (fully async / one step off) #5184

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

Uh oh!