RPC support for OmniDiffusion #371

knlnguyen1802 · 2025-12-19T01:08:39Z

Purpose

Fix #342 and partial #316
This is version for diffusion engine as continue of #355

cc: @ZJY0516 for author of Diffusion Engine

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: knlnguyen1802 <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-19T01:10:48Z

vllm_omni/diffusion/worker/gpu_worker.py

+                try:
+                    result = self.execute_rpc(msg)
+                    if result is not None and self.gpu_id == 0:
+                        self.return_result(result)


collective_rpc waits for replies other ranks never send

collective_rpc expects a reply from each worker unless unique_reply_rank is set, but in worker_busy_loop only rank 0 enqueues RPC responses (self.gpu_id == 0 gate) and other ranks drop their results because they lack a result queue. On multi-GPU runs any RPC targeting a non-zero rank or broadcast calls with unique_reply_rank=None will block/time out waiting for responses that are never sent.

Useful? React with 👍 / 👎.

Signed-off-by: knlnguyen1802 <[email protected]>

SamitHuang · 2025-12-22T03:10:39Z

how did you test it? maybe try to run generate, sleep/wake_up, and load_weight with RPC on multiple GPUs/NPUs?

knlnguyen1802 · 2025-12-22T03:14:05Z

how did you test it? maybe try to run generate, sleep/wake_up, and load_weight with RPC on multiple GPUs/NPUs?

I try to add it in e2e example and test it

ZJY0516 · 2025-12-22T03:31:36Z

vllm_omni/diffusion/scheduler.py


    def add_req(self, requests: list[OmniDiffusionRequest]) -> DiffusionOutput:
-        """Sends a request to the scheduler and waits for the response."""
+        """Sends a generation request via RPC to worker rank 0 and waits for the response."""


Could you explain that why we send to rank0 other than broadcast to all workers?

The docs are somewhat mistake when I first write it. Mistake on understanding how Scheduler work. Let's me fix it.
It should broadcast to all worker

Signed-off-by: knlnguyen1802 <[email protected]>

knlnguyen1802 · 2025-12-22T05:56:45Z

how did you test it? maybe try to run generate, sleep/wake_up, and load_weight with RPC on multiple GPUs/NPUs?

Added e2e test for rpc_collective

Signed-off-by: knlnguyen1802 <[email protected]>

ZJY0516 · 2025-12-22T07:42:40Z

I have a question that why we need this method. If we use it for inter-process communication, but diffusion engine and scheduler reside in same proc

knlnguyen1802 · 2025-12-22T07:46:08Z

I have a question that why we need this method. If we use it for inter-process communication, but diffusion engine and scheduler reside in same proc

I need it to call function from this PR #376 in offline mode.
The final purpose is support for verl integration
Yes, but the worker is on different process and this also one method to communicate with the worker too.

ZJY0516 · 2025-12-22T08:31:38Z

I have a question that why we need this method. If we use it for inter-process communication, but diffusion engine and scheduler reside in same proc

I need it to call function from this PR #376 in offline mode. The final purpose is support for verl integration Yes, but the worker is on different process and this also one method to communicate with the worker too.

make sense

ZJY0516 · 2025-12-22T08:32:24Z

@knlnguyen1802 please resolve the conflicts

Signed-off-by: knlnguyen1802 <[email protected]>

ZJY0516 · 2025-12-22T08:37:25Z

vllm_omni/diffusion/worker/gpu_worker.py

+            return {"status": "error", "error": str(e)}
+
    # TODO: queueing, cancellation
    def worker_busy_loop(self) -> None:


I think we should redesign this, because it has become more complex.

we can address the redesign in a separate, follow-up task if you don't have time.

here is a good example of worker_busy_loop: https://github.com/vllm-project/vllm/blob/c02a2705f9ceeb00b5d32453621f997b2ceafbea/vllm/v1/executor/multiproc_executor.py#L806

Agree with it, the redesign is WIP and will need a more structure RFC.

Just confirming — is this already WIP?

This PR is not WIP. But the redesign as you said above is on working

@ZJY0516 It's ready now. Could you take a look again thanks ?

Signed-off-by: knlnguyen1802 <[email protected]>

ZJY0516 · 2025-12-22T10:47:46Z

tests/e2e/test_rpc_collective.py

@@ -0,0 +1,46 @@
+import os


The test is a little stange here. I don't think we need an e2e test here. We can test it after #376 lands. cc @SamitHuang

yes, tests for this PR can be covered by the tests in #376

@knlnguyen1802 Could you please remove this file? Once that‘s done, we can merge this PR.

Signed-off-by: Jiangyun Zhu <[email protected]>

hsliuustc0106 · 2025-12-24T13:34:29Z

fix ci please @knlnguyen1802

knlnguyen1802 added 4 commits December 18, 2025 17:49

RPC for diffusion

0835c74

Signed-off-by: knlnguyen1802 <[email protected]>

Update scheduler

854620c

Signed-off-by: knlnguyen1802 <[email protected]>

Do RPC for diffusion

9eb3db7

Signed-off-by: knlnguyen1802 <[email protected]>

Clean code

86e3319

Signed-off-by: knlnguyen1802 <[email protected]>

knlnguyen1802 requested a review from hsliuustc0106 as a code owner December 19, 2025 01:08

chatgpt-codex-connector bot reviewed Dec 19, 2025

View reviewed changes

Fix pre-commit

971add7

Signed-off-by: knlnguyen1802 <[email protected]>

knlnguyen1802 mentioned this pull request Dec 19, 2025

[Feature][RL]: Support Model weight offload, reload and sync model weight & Offload DIT cache #316

Open

1 task

ZJY0516 reviewed Dec 22, 2025

View reviewed changes

knlnguyen1802 added 5 commits December 22, 2025 11:46

Fix docs and remove deprecated code

98c6344

Signed-off-by: knlnguyen1802 <[email protected]>

Fix pre-commit

23e6c3e

Signed-off-by: knlnguyen1802 <[email protected]>

Recover docs

e5ae766

Signed-off-by: knlnguyen1802 <[email protected]>

Fix pre-commit

2b169ed

Signed-off-by: knlnguyen1802 <[email protected]>

Add e2e test for rpc

7c6d3d4

Signed-off-by: knlnguyen1802 <[email protected]>

Fix pre-commit

2a42041

Signed-off-by: knlnguyen1802 <[email protected]>

Resolve conflict

b2f3dde

Signed-off-by: knlnguyen1802 <[email protected]>

ZJY0516 reviewed Dec 22, 2025

View reviewed changes

Remove cloudpickle

7550475

Signed-off-by: knlnguyen1802 <[email protected]>

chenyingshu mentioned this pull request Dec 22, 2025

[WIP][RFC] Support Qwen-Image Flow-GRPO Training based on vLLM-Omni volcengine/verl#4639

Open

9 tasks

ZJY0516 reviewed Dec 22, 2025

View reviewed changes

ZJY0516 added the ready label to trigger buildkite CI label Dec 22, 2025

Delete tests/e2e/test_rpc_collective.py

c3dff61

Signed-off-by: Jiangyun Zhu <[email protected]>

Merge branch 'main' into rpc_diffusion

b9d74f3

RPC support for OmniDiffusion #371

Are you sure you want to change the base?

RPC support for OmniDiffusion #371

Uh oh!

Conversation

knlnguyen1802 commented Dec 19, 2025 • edited by hsliuustc0106 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

SamitHuang commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knlnguyen1802 commented Dec 22, 2025

Uh oh!

ZJY0516 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

knlnguyen1802 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

knlnguyen1802 commented Dec 22, 2025

Uh oh!

ZJY0516 commented Dec 22, 2025

Uh oh!

knlnguyen1802 commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZJY0516 commented Dec 22, 2025

Uh oh!

ZJY0516 commented Dec 22, 2025

Uh oh!

ZJY0516 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

knlnguyen1802 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

knlnguyen1802 Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knlnguyen1802 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SamitHuang Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

knlnguyen1802 commented Dec 19, 2025 •

edited by hsliuustc0106

Loading

SamitHuang commented Dec 22, 2025 •

edited

Loading

knlnguyen1802 commented Dec 22, 2025 •

edited

Loading

knlnguyen1802 Dec 22, 2025 •

edited

Loading

ZJY0516 Dec 22, 2025 •

edited

Loading

SamitHuang Dec 23, 2025 •

edited

Loading