-
Notifications
You must be signed in to change notification settings - Fork 200
[Feature] Support Qwen Omni online batch inference #438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@Bounty-hunter PTAL |
| _gen_t1 = _time.time() | ||
| _gen_ms = (_gen_t1 - _gen_t0) * 1000.0 | ||
| await generation_out_q.put((rid, gen_output, _gen_ms)) | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that _generation_tasks_by_rid[rid] is not cleaned up when an exception occurs.
| try: | ||
| generation_task = _generation_tasks_by_rid.pop(rid, None) | ||
| if generation_task is None or not generation_task.done(): | ||
| raise asyncio.InvalidStateError(f"[Stage-{stage_id}] generation task failed for request: {rid}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check actually necessary here? maybe we can use the following code to auto clean _generation_tasks_by_rid for both normal and exception request?
task.add_done_callback(self._generation_tasks_by_rid.discard)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the changes look good.
| _batch_seq += 1 | ||
| if _stats_file: | ||
| _avg_tokens_per_s = ( | ||
| (_agg_total_tokens * 1000.0 / _agg_total_gen_time_ms) if _agg_total_gen_time_ms > 0 else 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For batch manner, the _agg_total_gen_time_ms is overestimated, because the _gen_ms for each individual request may overlaps.
|
fix precommit & DCO sign-off please |
Add continuous batching support for omni online serving, ref #410
Purpose
Changes:
generation_single_requestin_stage_worker_asyncto handle single request generationgeneration_out_qto collect request outputtest_video_to_audioin e2etest_qwen3_omni.pytotest_video_to_audio_concurrentto test batch generationTest Plan
python -m pytest -s -v test_qwen3_omni.pyTest Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)