Skip to content

Conversation

@knlnguyen1802
Copy link
Contributor

@knlnguyen1802 knlnguyen1802 commented Dec 19, 2025

Purpose

Fix #342 and partial #316
This is version for diffusion engine as continue of #355


cc: @ZJY0516 for author of Diffusion Engine

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +252 to +255
try:
result = self.execute_rpc(msg)
if result is not None and self.gpu_id == 0:
self.return_result(result)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge collective_rpc waits for replies other ranks never send

collective_rpc expects a reply from each worker unless unique_reply_rank is set, but in worker_busy_loop only rank 0 enqueues RPC responses (self.gpu_id == 0 gate) and other ranks drop their results because they lack a result queue. On multi-GPU runs any RPC targeting a non-zero rank or broadcast calls with unique_reply_rank=None will block/time out waiting for responses that are never sent.

Useful? React with 👍 / 👎.

Signed-off-by: knlnguyen1802 <[email protected]>
@SamitHuang
Copy link
Collaborator

SamitHuang commented Dec 22, 2025

how did you test it? maybe try to run generate, sleep/wake_up, and load_weight with RPC on multiple GPUs/NPUs?

@knlnguyen1802
Copy link
Contributor Author

how did you test it? maybe try to run generate, sleep/wake_up, and load_weight with RPC on multiple GPUs/NPUs?

I try to add it in e2e example and test it


def add_req(self, requests: list[OmniDiffusionRequest]) -> DiffusionOutput:
"""Sends a request to the scheduler and waits for the response."""
"""Sends a generation request via RPC to worker rank 0 and waits for the response."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain that why we send to rank0 other than broadcast to all workers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs are somewhat mistake when I first write it. Mistake on understanding how Scheduler work. Let's me fix it.
It should broadcast to all worker

Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
@knlnguyen1802
Copy link
Contributor Author

how did you test it? maybe try to run generate, sleep/wake_up, and load_weight with RPC on multiple GPUs/NPUs?

Added e2e test for rpc_collective

Signed-off-by: knlnguyen1802 <[email protected]>
@ZJY0516
Copy link
Collaborator

ZJY0516 commented Dec 22, 2025

I have a question that why we need this method. If we use it for inter-process communication, but diffusion engine and scheduler reside in same proc

@knlnguyen1802
Copy link
Contributor Author

knlnguyen1802 commented Dec 22, 2025

I have a question that why we need this method. If we use it for inter-process communication, but diffusion engine and scheduler reside in same proc

I need it to call function from this PR #376 in offline mode.
The final purpose is support for verl integration
Yes, but the worker is on different process and this also one method to communicate with the worker too.

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Dec 22, 2025

I have a question that why we need this method. If we use it for inter-process communication, but diffusion engine and scheduler reside in same proc

I need it to call function from this PR #376 in offline mode. The final purpose is support for verl integration Yes, but the worker is on different process and this also one method to communicate with the worker too.

make sense

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Dec 22, 2025

@knlnguyen1802 please resolve the conflicts

Signed-off-by: knlnguyen1802 <[email protected]>
return {"status": "error", "error": str(e)}

# TODO: queueing, cancellation
def worker_busy_loop(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should redesign this, because it has become more complex.

we can address the redesign in a separate, follow-up task if you don't have time.

here is a good example of worker_busy_loop: https://github.com/vllm-project/vllm/blob/c02a2705f9ceeb00b5d32453621f997b2ceafbea/vllm/v1/executor/multiproc_executor.py#L806

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with it, the redesign is WIP and will need a more structure RFC.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming — is this already WIP?

Copy link
Contributor Author

@knlnguyen1802 knlnguyen1802 Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not WIP. But the redesign as you said above is on working

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZJY0516 It's ready now. Could you take a look again thanks ?

Signed-off-by: knlnguyen1802 <[email protected]>
@@ -0,0 +1,46 @@
import os
Copy link
Collaborator

@ZJY0516 ZJY0516 Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is a little stange here. I don't think we need an e2e test here. We can test it after #376 lands. cc @SamitHuang

Copy link
Collaborator

@SamitHuang SamitHuang Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, tests for this PR can be covered by the tests in #376

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knlnguyen1802 Could you please remove this file? Once that‘s done, we can merge this PR.

@ZJY0516 ZJY0516 added the ready label to trigger buildkite CI label Dec 22, 2025
@hsliuustc0106
Copy link
Collaborator

fix ci please @knlnguyen1802

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Support more functionality other than generate

4 participants