Skip to content

[WIP] new scheduler policy based on tkv shift ratio#957

Draft
yannicks1 wants to merge 5 commits into
torch-spyre:mainfrom
yannicks1:rel-tkv-shift-aware-scheduler
Draft

[WIP] new scheduler policy based on tkv shift ratio#957
yannicks1 wants to merge 5 commits into
torch-spyre:mainfrom
yannicks1:rel-tkv-shift-aware-scheduler

Conversation

@yannicks1
Copy link
Copy Markdown
Collaborator

Description

Currently the chunked-prefill scheduler admits waiting requests strictly FIFO, which shifts the decode-batch tkv upward for every new admission. This hurts decoders already in the batch: a long prompt joining behind short decoders drags them onto a larger compiled program and inflates their ITL for the rest of their lifetime.

This PR adds a soft admission gate that skips a candidate whose admission would push the (block-aligned) decode tkv by more than a configurable ratio, giving shorter requests further down the queue a chance to join instead. A per-request skip counter prevents the long request from starving.

  • SENDNN_INFERENCE_MAX_TKV_SHIFT_RATIO (default 1.5): skip a candidate if new_tkv / current_tkv exceeds this ratio. Set to inf to disable and restore FIFO.
  • SENDNN_INFERENCE_MAX_SKIP_COUNT (default 4): force-admit a request after it has been skipped this many times (anti-starvation).

Related Issues

#746

Checklist

  • I have read the contributing guidelines
  • My code follows the project's code style (run bash format.sh)
  • I have added tests for my changes (if applicable)
  • I have updated the documentation (if applicable)
  • My commits include a Signed-off-by: line (DCO compliance)

…chanism

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

👋 Hi! Thank you for contributing.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
@yannicks1
Copy link
Copy Markdown
Collaborator Author

bot:bench
NUM_PROMPTS=1000
MAX_RUN_TIME=36000
IGNORE_EOS=1
CUSTOM_OUTPUT_LEN=-1
MAX_CONCURRENT=4

yannicks1 added 3 commits May 4, 2026 23:00
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
@sducouedic
Copy link
Copy Markdown
Collaborator

sducouedic commented May 8, 2026

@yannicks1 @tdoublep when reading about this I thought about something. Maybe instead of using a hardcoded ratio limit (SENDNN_INFERENCE_MAX_TKV_SHIFT_RATIO) which the next request should satisfy (binary yes/no), we can think of it as an optimization problem. Here is the reasoning:

  • why do we want this ratio? it is to try grouping short requests together, and long requests together. Basically we want more homogeneity in the decode batch
  • the underlying reason for that is that long requests force short ones to left-pad with full blocks. Padding represents simply and purely a waste of compute, we want to reduce dummy blocks padding.

The "optimization problem" part:

  • for each request in the queue we associate a cost that we want to minimize: the number of total number of padding blocks in the batch if we schedule the request.
  • now our problem becomes a more classical cost-fairness scheduling problem that we see in OS, and we can use something like picking the request that minimize the cost: cost = total_padding_blocks_if_scheduled + w * waiting_time (or replace waiting time by skip count)

the only problem I see here (that might actually make this idea a no-go) is that every time we consider scheduling a new request, we need to compute the total num_paddings for all the requests in the queue, and we don't want this optimization logic to become the new decode bottleneck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants