-
Notifications
You must be signed in to change notification settings - Fork 207
RPC support for OmniDiffusion #371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| try: | ||
| result = self.execute_rpc(msg) | ||
| if result is not None and self.gpu_id == 0: | ||
| self.return_result(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collective_rpc waits for replies other ranks never send
collective_rpc expects a reply from each worker unless unique_reply_rank is set, but in worker_busy_loop only rank 0 enqueues RPC responses (self.gpu_id == 0 gate) and other ranks drop their results because they lack a result queue. On multi-GPU runs any RPC targeting a non-zero rank or broadcast calls with unique_reply_rank=None will block/time out waiting for responses that are never sent.
Useful? React with 👍 / 👎.
Signed-off-by: knlnguyen1802 <[email protected]>
|
how did you test it? maybe try to run generate, sleep/wake_up, and load_weight with RPC on multiple GPUs/NPUs? |
I try to add it in e2e example and test it |
vllm_omni/diffusion/scheduler.py
Outdated
|
|
||
| def add_req(self, requests: list[OmniDiffusionRequest]) -> DiffusionOutput: | ||
| """Sends a request to the scheduler and waits for the response.""" | ||
| """Sends a generation request via RPC to worker rank 0 and waits for the response.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain that why we send to rank0 other than broadcast to all workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs are somewhat mistake when I first write it. Mistake on understanding how Scheduler work. Let's me fix it.
It should broadcast to all worker
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Added e2e test for rpc_collective |
Signed-off-by: knlnguyen1802 <[email protected]>
|
I have a question that why we need this method. If we use it for inter-process communication, but diffusion engine and scheduler reside in same proc |
I need it to call function from this PR #376 in offline mode. |
make sense |
|
@knlnguyen1802 please resolve the conflicts |
Signed-off-by: knlnguyen1802 <[email protected]>
| return {"status": "error", "error": str(e)} | ||
|
|
||
| # TODO: queueing, cancellation | ||
| def worker_busy_loop(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should redesign this, because it has become more complex.
we can address the redesign in a separate, follow-up task if you don't have time.
here is a good example of worker_busy_loop: https://github.com/vllm-project/vllm/blob/c02a2705f9ceeb00b5d32453621f997b2ceafbea/vllm/v1/executor/multiproc_executor.py#L806
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with it, the redesign is WIP and will need a more structure RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming — is this already WIP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is not WIP. But the redesign as you said above is on working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZJY0516 It's ready now. Could you take a look again thanks ?
Signed-off-by: knlnguyen1802 <[email protected]>
tests/e2e/test_rpc_collective.py
Outdated
| @@ -0,0 +1,46 @@ | |||
| import os | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is a little stange here. I don't think we need an e2e test here. We can test it after #376 lands. cc @SamitHuang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, tests for this PR can be covered by the tests in #376
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knlnguyen1802 Could you please remove this file? Once that‘s done, we can merge this PR.
Signed-off-by: Jiangyun Zhu <[email protected]>
|
fix ci please @knlnguyen1802 |
Purpose
Fix #342 and partial #316
This is version for diffusion engine as continue of #355
cc: @ZJY0516 for author of Diffusion Engine
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)