[Generative Score API] Fix on prefill-only scheduler running batch loss track problem #14320

haNa-meister · 2025-12-02T22:56:17Z

Motivation

Currently, in last change for prefill-only: PR, to have higher throughput, we decided to skip decode scheduling stage. However, in its implementation, it will skip running_batch to merge with last_batch which makes running_batch is always empty.

Metrics

Fix sglang:num_running_reqs is always 0.0 problem

Safe mechanism

Since, for this problem running_batch is always empty, and running_lens is always 0. Thus, lots of safe mechanism in scheduler is not enabled, for example: link, link

Modifications

Move skip decode for prefill-only logic into run decode branch.
Filter running_batch in each scheduling loop to avoid keep tracking finished requests.

Accuracy Tests

Test env:
GPU: H100.
Model: Qwen3-0.6B.

Running on this PR

curl -X POST "http://localhost:8080/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
  }'
{"scores":[[0.00014663469283989953,6.92653243690209e-05],[0.00016726256415974512,5.4302204151145536e-05],[0.0002321074899909774,5.179018141333686e-05]], ...

Running on last version

curl -X POST "http://localhost:8080/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
  }'
{"scores":[[0.00014663469283989953,6.92653243690209e-05],[0.00016726256415974512,5.4302204151145536e-05],[0.0002321074899909774,5.179018141333686e-05]],"model":"...,"usage":null,"object":"scoring"}

Metrics

It is clear that for prefill-only request, the sglang:num_running_reqs will be always 0.0. Because it is tracking the running_batch's length. Below is the example metrics api response during benchmark period.

Metrics before this PR

# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="qwen",pp_rank="0",tp_rank="0"} 0.0
...
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="qwen",pp_rank="0",tp_rank="0"} 0.0

Metrics with this PR

 # TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="qwen",pp_rank="0",tp_rank="0"} 30.0

# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="qwen",pp_rank="0",tp_rank="0"} 0.0

Benchmarking and Profiling

Benchmark env:
GPU: H100.
Model: Qwen3-0.6B.
QPS: 160.
Items per request: 10.
Tokens per query: 120.
Tokens per items: 180

Running on this PR

Overall Summary for RPS 160, Duration 60s, Item Count 10:
  Test duration:         60 seconds
  Server type:           HTTP
  HTTP mode:             SCORE
  Target RPS:            160
  Item count:            10
  Distribution:          POISSON
  Unique requests generated: 100
  Total requests sent:   9600
  Successful requests:   7162
  Failed requests:       2438
  Overall successful items/sec: 1197.76
  Time to send all requests: 59.73 seconds
  Time for all requests to complete: 59.80 seconds
  Average response time: 120.51 ms
  P50 response time:     118.85 ms
  P90 response time:     161.84 ms
  P95 response time:     175.62 ms
  P99 response time:     255.14 ms

Running on last version

Overall Summary for RPS 160, Duration 60s, Item Count 10:
  Test duration:         60 seconds
  Server type:           HTTP
  HTTP mode:             SCORE
  Target RPS:            160
  Item count:            10
  Distribution:          POISSON
  Unique requests generated: 100
  Total requests sent:   9600
  Successful requests:   6760
  Failed requests:       2840
  Overall successful items/sec: 1133.12
  Time to send all requests: 59.58 seconds
  Time for all requests to complete: 59.66 seconds
  Average response time: 130.39 ms
  P50 response time:     131.42 ms
  P90 response time:     174.63 ms
  P95 response time:     186.60 ms
  P99 response time:     249.31 ms

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-02T22:56:33Z

Summary of Changes

Hello @haNa-meister, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue in the generative score API's scheduler where prefill-only requests were not being properly tracked. This led to inaccurate sglang:num_running_reqs metrics and disabled important safety mechanisms. The changes refactor the batch handling and decode skipping logic for prefill-only requests, ensuring correct metric reporting, re-enabling safety checks, and demonstrating improved throughput and success rates in benchmarks.

Highlights

Corrected Metric Tracking: Addressed an issue where sglang:num_running_reqs was consistently reporting 0.0 for prefill-only requests, now accurately reflecting the number of active requests.
Re-enabled Safety Mechanisms: Fixed a problem where critical safety mechanisms in the scheduler were not being activated due to running_batch being empty for prefill-only requests.
Refined Prefill-Only Logic: Moved the logic for skipping decode steps for prefill-only batches into the run decode branch, ensuring these batches are correctly processed without unnecessary decode operations.
Improved Batch Filtering: Implemented a mechanism to filter out finished requests from running_batch in each scheduling loop specifically for prefill-only batches, maintaining an accurate state of active requests.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly fixes an issue where prefill-only requests were not being tracked in the running_batch, leading to incorrect metrics and disabled safety mechanisms. The changes are well-reasoned and implemented cleanly. By allowing prefill-only batches to be merged into the running_batch and then explicitly skipping the decode step for them, the core issue is resolved. The addition of a manual filtering step for prefill-only running batches is a necessary and correct adjustment to ensure finished requests are properly cleaned up. The provided benchmarks also indicate a slight performance improvement, which is a great result. The code is clear and the changes are solid.

sundar24295s

LGTM! Thanks for fixing this

haNa-meister requested review from Ying1123, hnyls2002, merrymercy, xiezhq-hermann and zhyncs as code owners December 2, 2025 22:56

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

haNa-meister force-pushed the main branch from 801baeb to b10c608 Compare December 9, 2025 07:23

sundar24295s added the run-ci label Dec 10, 2025

haNa-meister force-pushed the main branch from b10c608 to 79460b6 Compare December 11, 2025 01:14

sundar24295s approved these changes Dec 11, 2025

View reviewed changes

merrymercy approved these changes Dec 17, 2025

View reviewed changes

sundar24295s enabled auto-merge (squash) December 17, 2025 23:42

haNa-meister added 3 commits December 19, 2025 13:23

debug: fix running batch tracking for prefill-only batch

c6e4086

fix filter running batch

8794861

format

5eb5c20

auto-merge was automatically disabled December 19, 2025 21:23
Head branch was pushed to by a user without write access

haNa-meister force-pushed the main branch from 140c6ac to 5eb5c20 Compare December 19, 2025 21:23

hnyls2002 enabled auto-merge (squash) December 26, 2025 12:10

Merge branch 'main' into main

68b1b97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Generative Score API] Fix on prefill-only scheduler running batch loss track problem #14320

[Generative Score API] Fix on prefill-only scheduler running batch loss track problem #14320

Uh oh!

haNa-meister commented Dec 2, 2025

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

sundar24295s left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Generative Score API] Fix on prefill-only scheduler running batch loss track problem #14320

Are you sure you want to change the base?

[Generative Score API] Fix on prefill-only scheduler running batch loss track problem #14320

Uh oh!

Conversation

haNa-meister commented Dec 2, 2025

Motivation

Metrics

Safe mechanism

Modifications

Accuracy Tests

Running on this PR

Running on last version

Metrics

Metrics before this PR

Metrics with this PR

Benchmarking and Profiling

Running on this PR

Running on last version

Checklist

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

sundar24295s left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants