🚀 Enhance GRPO VLLM server from sync to async and accelerate training #3182

binary-husky · 2025-03-30T13:22:40Z

Change VLLM Server from Sync to Async
- if is_async=True:
  client first call generate (non-blocking), then after a while call get_future (with identical arguments) to get result
- if is_async=False:
  client automatically call get_future inside generate, blocking further execution before the generation is complete

Speed up grpo_trainer 1.5x faster by submitting N=gradient_accumulation_steps batches, so that training and vllm generation can run in parallel!
However, I have to admit that this piece of code is not elegant enough, remove them if they disqualifies.

I leave some room by adding a RolloutEngine in trl.scripts.vllm_serve, for more sophisticated vllm inference functionality, trying to support lm_generate > MCP tool_call > lm_generate > another MCP tool_call > ..., but not complete yet.

add vllm_server_nccl_port in config (previously cannot change default)

binary-husky · 2025-03-30T13:33:58Z

oh, there is another detail worth mentioning:

I add a version param, self.version += 1 whenever update_model_params is called.

at server side, I add some lines to ensure there are no on-going generation with some async sleep logic

fabianlim · 2025-03-31T03:01:48Z

@binary-husky the speedups you posted look great, though I have a question on how you parallelize the computation. THe picture shows a data dependency between roll outs and model training (and vllm update).

are you saying that within gradient accumulation steps the rollouts do not change?
the completion_ids are futures, are you saying the will return enough rollouts for you to complete the grad accum step?

In other words, this achieve parallization within grad accum steps, and works only if the grad accum > 1?

binary-husky · 2025-03-31T06:23:11Z

@fabianlim Yes, works only if the grad acc step > 1.

tcapelle · 2025-06-16T13:37:04Z

We should make this happen! The new Mistral reasoning model uses a pipeline like this one.

tcapelle · 2025-06-17T17:54:33Z

Shouldn't populating _buffered_inputs asynchronously be enough?

vllm sync -> async

5f951c0

binary-husky changed the title ~~(AsyncLLMEngine) Change VLLM Server from Sync to Async~~ 🚀 (AsyncLLMEngine) Improve GRPO VLLM Server from Sync to Async Mar 30, 2025

binary-husky mentioned this pull request Mar 30, 2025

Co-Locating vLLM w/ training to achieve higher throughput and GPU utilization #3162

Closed

5 tasks

shirinyamani self-requested a review March 30, 2025 17:40

binary-husky changed the title ~~🚀 (AsyncLLMEngine) Improve GRPO VLLM Server from Sync to Async~~ 🚀 Enhance GRPO VLLM server from sync to async and accelerate training Mar 31, 2025

binary-husky and others added 4 commits April 1, 2025 19:17

update task display

9bd7c17

Merge branch 'main' into binary_husky_main

5d75d5f

Merge branch 'main' into binary_husky_main

25cb0ce

Merge branch 'main' into binary_husky_main

497927e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 Enhance GRPO VLLM server from sync to async and accelerate training #3182

🚀 Enhance GRPO VLLM server from sync to async and accelerate training #3182

Uh oh!

binary-husky commented Mar 30, 2025

Uh oh!

binary-husky commented Mar 30, 2025

Uh oh!

fabianlim commented Mar 31, 2025

Uh oh!

binary-husky commented Mar 31, 2025 •

edited

Loading

Uh oh!

tcapelle commented Jun 16, 2025 •

edited

Loading

Uh oh!

tcapelle commented Jun 17, 2025

Uh oh!

Uh oh!

🚀 Enhance GRPO VLLM server from sync to async and accelerate training #3182

Are you sure you want to change the base?

🚀 Enhance GRPO VLLM server from sync to async and accelerate training #3182

Uh oh!

Conversation

binary-husky commented Mar 30, 2025

Uh oh!

binary-husky commented Mar 30, 2025

Uh oh!

fabianlim commented Mar 31, 2025

Uh oh!

binary-husky commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcapelle commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcapelle commented Jun 17, 2025

Uh oh!

Uh oh!

binary-husky commented Mar 31, 2025 •

edited

Loading

tcapelle commented Jun 16, 2025 •

edited

Loading