[WIP] new scheduler policy based on tkv shift ratio by yannicks1 · Pull Request #957 · torch-spyre/sendnn-inference

yannicks1 · 2026-05-04T20:44:36Z

Description

Currently the chunked-prefill scheduler admits waiting requests strictly FIFO, which shifts the decode-batch tkv upward for every new admission. This hurts decoders already in the batch: a long prompt joining behind short decoders drags them onto a larger compiled program and inflates their ITL for the rest of their lifetime.

This PR adds a soft admission gate that skips a candidate whose admission would push the (block-aligned) decode tkv by more than a configurable ratio, giving shorter requests further down the queue a chance to join instead. A per-request skip counter prevents the long request from starving.

SENDNN_INFERENCE_MAX_TKV_SHIFT_RATIO (default 1.5): skip a candidate if new_tkv / current_tkv exceeds this ratio. Set to inf to disable and restore FIFO.
SENDNN_INFERENCE_MAX_SKIP_COUNT (default 4): force-admit a request after it has been skipped this many times (anti-starvation).

Related Issues

#746

Checklist

I have read the contributing guidelines
My code follows the project's code style (run bash format.sh)
I have added tests for my changes (if applicable)
I have updated the documentation (if applicable)
My commits include a Signed-off-by: line (DCO compliance)

…chanism Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

github-actions · 2026-05-04T20:44:47Z

👋 Hi! Thank you for contributing.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

yannicks1 · 2026-05-04T20:52:09Z

bot:bench
NUM_PROMPTS=1000
MAX_RUN_TIME=36000
IGNORE_EOS=1
CUSTOM_OUTPUT_LEN=-1
MAX_CONCURRENT=4

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

sducouedic · 2026-05-08T19:52:01Z

@yannicks1 @tdoublep when reading about this I thought about something. Maybe instead of using a hardcoded ratio limit (SENDNN_INFERENCE_MAX_TKV_SHIFT_RATIO) which the next request should satisfy (binary yes/no), we can think of it as an optimization problem. Here is the reasoning:

why do we want this ratio? it is to try grouping short requests together, and long requests together. Basically we want more homogeneity in the decode batch
the underlying reason for that is that long requests force short ones to left-pad with full blocks. Padding represents simply and purely a waste of compute, we want to reduce dummy blocks padding.

The "optimization problem" part:

for each request in the queue we associate a cost that we want to minimize: the number of total number of padding blocks in the batch if we schedule the request.
now our problem becomes a more classical cost-fairness scheduling problem that we see in OS, and we can use something like picking the request that minimize the cost: cost = total_padding_blocks_if_scheduled + w * waiting_time (or replace waiting time by skip count)

the only problem I see here (that might actually make this idea a no-go) is that every time we consider scheduling a new request, we need to compute the total num_paddings for all the requests in the queue, and we don't want this optimization logic to become the new decode bottleneck

new scheduler policy based on tkv shift ratio with anti starvation me…

105c8b3

…chanism Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

fix ruff

2385639

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

yannicks1 added 3 commits May 4, 2026 23:00

fix structured output test

055832d

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

add logging

541c9f0

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

Merge branch 'torch-spyre:main' into rel-tkv-shift-aware-scheduler

ea06584

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] new scheduler policy based on tkv shift ratio#957

[WIP] new scheduler policy based on tkv shift ratio#957
yannicks1 wants to merge 5 commits into
torch-spyre:mainfrom
yannicks1:rel-tkv-shift-aware-scheduler

yannicks1 commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

yannicks1 commented May 4, 2026

Uh oh!

sducouedic commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yannicks1 commented May 4, 2026

Description

Related Issues

Checklist

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

yannicks1 commented May 4, 2026

Uh oh!

sducouedic commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sducouedic commented May 8, 2026 •

edited

Loading