Skip to content

Conversation

@haNa-meister
Copy link

Motivation

Currently, in last change for prefill-only: PR, to have higher throughput, we decided to skip decode scheduling stage. However, in its implementation, it will skip running_batch to merge with last_batch which makes running_batch is always empty.

Metrics

  • Fix sglang:num_running_reqs is always 0.0 problem

Safe mechanism

  • Since, for this problem running_batch is always empty, and running_lens is always 0. Thus, lots of safe mechanism in scheduler is not enabled, for example: link, link

Modifications

  • Move skip decode for prefill-only logic into run decode branch.
  • Filter running_batch in each scheduling loop to avoid keep tracking finished requests.

Accuracy Tests

Test env:
GPU: H100.
Model: Qwen3-0.6B.

Running on this PR

curl -X POST "http://localhost:8080/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
  }'
{"scores":[[0.00014663469283989953,6.92653243690209e-05],[0.00016726256415974512,5.4302204151145536e-05],[0.0002321074899909774,5.179018141333686e-05]], ...

Running on last version

curl -X POST "http://localhost:8080/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
  }'
{"scores":[[0.00014663469283989953,6.92653243690209e-05],[0.00016726256415974512,5.4302204151145536e-05],[0.0002321074899909774,5.179018141333686e-05]],"model":"...,"usage":null,"object":"scoring"}

Metrics

It is clear that for prefill-only request, the sglang:num_running_reqs will be always 0.0. Because it is tracking the running_batch's length. Below is the example metrics api response during benchmark period.

Metrics before this PR

# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="qwen",pp_rank="0",tp_rank="0"} 0.0
...
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="qwen",pp_rank="0",tp_rank="0"} 0.0

Metrics with this PR

 # TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="qwen",pp_rank="0",tp_rank="0"} 30.0

# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="qwen",pp_rank="0",tp_rank="0"} 0.0

Benchmarking and Profiling

Benchmark env:
GPU: H100.
Model: Qwen3-0.6B.
QPS: 160.
Items per request: 10.
Tokens per query: 120.
Tokens per items: 180

Running on this PR

Overall Summary for RPS 160, Duration 60s, Item Count 10:
  Test duration:         60 seconds
  Server type:           HTTP
  HTTP mode:             SCORE
  Target RPS:            160
  Item count:            10
  Distribution:          POISSON
  Unique requests generated: 100
  Total requests sent:   9600
  Successful requests:   7162
  Failed requests:       2438
  Overall successful items/sec: 1197.76
  Time to send all requests: 59.73 seconds
  Time for all requests to complete: 59.80 seconds
  Average response time: 120.51 ms
  P50 response time:     118.85 ms
  P90 response time:     161.84 ms
  P95 response time:     175.62 ms
  P99 response time:     255.14 ms

Running on last version

Overall Summary for RPS 160, Duration 60s, Item Count 10:
  Test duration:         60 seconds
  Server type:           HTTP
  HTTP mode:             SCORE
  Target RPS:            160
  Item count:            10
  Distribution:          POISSON
  Unique requests generated: 100
  Total requests sent:   9600
  Successful requests:   6760
  Failed requests:       2840
  Overall successful items/sec: 1133.12
  Time to send all requests: 59.58 seconds
  Time for all requests to complete: 59.66 seconds
  Average response time: 130.39 ms
  P50 response time:     131.42 ms
  P90 response time:     174.63 ms
  P95 response time:     186.60 ms
  P99 response time:     249.31 ms

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @haNa-meister, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue in the generative score API's scheduler where prefill-only requests were not being properly tracked. This led to inaccurate sglang:num_running_reqs metrics and disabled important safety mechanisms. The changes refactor the batch handling and decode skipping logic for prefill-only requests, ensuring correct metric reporting, re-enabling safety checks, and demonstrating improved throughput and success rates in benchmarks.

Highlights

  • Corrected Metric Tracking: Addressed an issue where sglang:num_running_reqs was consistently reporting 0.0 for prefill-only requests, now accurately reflecting the number of active requests.
  • Re-enabled Safety Mechanisms: Fixed a problem where critical safety mechanisms in the scheduler were not being activated due to running_batch being empty for prefill-only requests.
  • Refined Prefill-Only Logic: Moved the logic for skipping decode steps for prefill-only batches into the run decode branch, ensuring these batches are correctly processed without unnecessary decode operations.
  • Improved Batch Filtering: Implemented a mechanism to filter out finished requests from running_batch in each scheduling loop specifically for prefill-only batches, maintaining an accurate state of active requests.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes an issue where prefill-only requests were not being tracked in the running_batch, leading to incorrect metrics and disabled safety mechanisms. The changes are well-reasoned and implemented cleanly. By allowing prefill-only batches to be merged into the running_batch and then explicitly skipping the decode step for them, the core issue is resolved. The addition of a manual filtering step for prefill-only running batches is a necessary and correct adjustment to ensure finished requests are properly cleaned up. The provided benchmarks also indicate a slight performance improvement, which is a great result. The code is clear and the changes are solid.

Copy link
Collaborator

@sundar24295s sundar24295s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for fixing this

@sundar24295s sundar24295s enabled auto-merge (squash) December 17, 2025 23:42
auto-merge was automatically disabled December 19, 2025 21:23

Head branch was pushed to by a user without write access

@hnyls2002 hnyls2002 enabled auto-merge (squash) December 26, 2025 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants