-
Notifications
You must be signed in to change notification settings - Fork 201
RPC support for OmniLLM #355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: knlnguyen1802 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm_omni/entrypoints/omni_stage.py
Outdated
| _logging.getLogger(__name__).debug( | ||
| "[Stage-%s] Executing collective_rpc: method=%s", stage_id, method | ||
| ) | ||
| result = stage_engine.collective_rpc(method, timeout, args, kwargs) | ||
| out_q.put({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worker RPC handler calls missing method
The collective_rpc path in the worker invokes stage_engine.collective_rpc(...), but the stage engine is an OmniStageLLM (created at line 500) and there is no collective_rpc implementation anywhere in that class or the rest of the repo. Every collective_rpc task will therefore raise an AttributeError and be returned as an error response, so the new API never actually executes on the worker.
Useful? React with 👍 / 👎.
vllm_omni/entrypoints/omni_stage.py
Outdated
| result = self.try_collect() | ||
| if result is not None and result.get("type") == "collective_rpc_result" and result.get("rpc_id") == rpc_id: | ||
| if "error" in result: | ||
| raise RuntimeError(f"collective_rpc failed: {result['error']}") | ||
| return result["result"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collective_rpc drops unrelated out-queue messages
While waiting for a matching collective_rpc_result, this loop consumes any item from self._out_q via try_collect() and discards it if the type/rpc_id do not match. That permanently removes normal stage outputs that arrive while the RPC is pending, causing those requests to hang because their responses are lost. The code should buffer or requeue non-RPC messages instead of dropping them.
Useful? React with 👍 / 👎.
vllm_omni/entrypoints/omni_stage.py
Outdated
| assert self._in_q is not None and self._out_q is not None, "Queues must be attached before collective_rpc" | ||
|
|
||
| # Submit collective_rpc task to worker | ||
| import uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we use lazy import here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it should be a draft [WIP] but thanks for the review. Will fix it
Signed-off-by: knlnguyen1802 <[email protected]>
|
@ZJY0516 Could you have a look again thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| try: | ||
| result = stage_engine.collective_rpc(method, timeout, args, kwargs) | ||
| out_q.put({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collective_rpc task calls missing engine method
The worker loop dispatches collective RPCs to stage_engine.collective_rpc(...), but OmniStageLLM (and the rest of the repo—checked with rg "collective_rpc") defines no such method. Any collective_rpc task will therefore raise an AttributeError in the worker, propagate back as an error reply, and cause OmniStage.collective_rpc to raise a RuntimeError, leaving the new API unusable.
Useful? React with 👍 / 👎.
| result = self.try_collect() | ||
| if result is not None: | ||
| if result.get("type") == "collective_rpc_result": | ||
| if result.get("rpc_id") == rpc_id: | ||
| if "error" in result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RPC wait loop discards non-RPC outputs
While waiting for a matching collective_rpc_result, the new collective_rpc method pops any item from self._out_q via try_collect() and ignores everything that is not the target RPC result. That silently drops unrelated stage outputs (e.g., pending inference responses or status messages), so issuing a collective RPC while other work is in-flight will lose those messages and leave callers hanging without responses.
Useful? React with 👍 / 👎.
Signed-off-by: knlnguyen1802 <[email protected]>
Purpose
Fix #342 and partial fix #316
This will support to call sleep, wait and load_weights for AR ModelRunner
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)