-
Notifications
You must be signed in to change notification settings - Fork 99
Fix async_scheduling + batched prefill #741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix async_scheduling + batched prefill #741
Conversation
Signed-off-by: Tianmu Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes issues with async scheduling when combined with batched prefill by properly handling incomplete prompts and refactoring state management for structured output.
Key Changes:
- Refactored
structured_outputfrom a local variable to an instance variableuse_structured_outputfor better state management - Added logic to ensure at least 1 logit is generated for incomplete prompts when using async scheduling or structured output
- Moved
invalid_req_indicesinitialization and tracking to instance-level scope for proper lifecycle management
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
|
|
||
| self.use_async_scheduling = self.scheduler_config.async_scheduling | ||
| self.use_structured_output: bool = False # Default to false. Set to true when needed during a run |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalized 'false' and 'true' to match Python boolean literals 'False' and 'True'.
| self.use_structured_output: bool = False # Default to false. Set to true when needed during a run | |
| self.use_structured_output: bool = False # Default to False. Set to True when needed during a run |
| # NOTE(tianmu-li): align behavior of incomplete prompt with gpu_model_runner | ||
| # Always have at least 1 logit when using async scheduling | ||
| # or structured output | ||
| if seq_num_computed_tokens + seq_num_scheduled_tokens - num_prompt_tokens + 1 < 1: |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition seq_num_computed_tokens + seq_num_scheduled_tokens - num_prompt_tokens + 1 < 1 is complex and unclear. Consider simplifying to seq_num_computed_tokens + seq_num_scheduled_tokens < num_prompt_tokens for better readability.
| if seq_num_computed_tokens + seq_num_scheduled_tokens - num_prompt_tokens + 1 < 1: | |
| if seq_num_computed_tokens + seq_num_scheduled_tokens < num_prompt_tokens: |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Cherry-pick of #740