Skip to content

Conversation

@ZeldaHuang
Copy link

@ZeldaHuang ZeldaHuang commented Dec 23, 2025

Add continuous batching support for omni online serving, ref #410

Purpose

Changes:

  • Add an async function generation_single_request in _stage_worker_async to handle single request generation
  • Add an async queue generation_out_q to collect request output
  • Change test_video_to_audio in e2e test_qwen3_omni.py to test_video_to_audio_concurrent to test batch generation

Test Plan

python -m pytest -s -v test_qwen3_omni.py

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ZeldaHuang ZeldaHuang marked this pull request as draft December 23, 2025 12:41
@ZeldaHuang ZeldaHuang marked this pull request as ready for review December 24, 2025 11:18
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ZeldaHuang ZeldaHuang changed the title [WIP][Feature] Support Qwen Omni online batch inference [Feature] Support Qwen Omni online batch inference Dec 24, 2025
@hsliuustc0106
Copy link
Collaborator

@Bounty-hunter PTAL

_gen_t1 = _time.time()
_gen_ms = (_gen_t1 - _gen_t0) * 1000.0
await generation_out_q.put((rid, gen_output, _gen_ms))
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that _generation_tasks_by_rid[rid] is not cleaned up when an exception occurs.

try:
generation_task = _generation_tasks_by_rid.pop(rid, None)
if generation_task is None or not generation_task.done():
raise asyncio.InvalidStateError(f"[Stage-{stage_id}] generation task failed for request: {rid}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check actually necessary here? maybe we can use the following code to auto clean _generation_tasks_by_rid for both normal and exception request?

task.add_done_callback(self._generation_tasks_by_rid.discard)

Copy link
Contributor

@Bounty-hunter Bounty-hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the changes look good.

_batch_seq += 1
if _stats_file:
_avg_tokens_per_s = (
(_agg_total_tokens * 1000.0 / _agg_total_gen_time_ms) if _agg_total_gen_time_ms > 0 else 0.0
Copy link
Contributor

@Bounty-hunter Bounty-hunter Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For batch manner, the _agg_total_gen_time_ms is overestimated, because the _gen_ms for each individual request may overlaps.

@hsliuustc0106
Copy link
Collaborator

fix precommit & DCO sign-off please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants