[WIP] new scheduler policy based on tkv shift ratio#957
Conversation
…chanism Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
|
👋 Hi! Thank you for contributing. We also recommend installing prek and configuring it to check your code before every local commit. |
|
bot:bench |
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
|
@yannicks1 @tdoublep when reading about this I thought about something. Maybe instead of using a hardcoded ratio limit (
The "optimization problem" part:
the only problem I see here (that might actually make this idea a no-go) is that every time we consider scheduling a new request, we need to compute the total num_paddings for all the requests in the queue, and we don't want this optimization logic to become the new decode bottleneck |
Description
Currently the chunked-prefill scheduler admits waiting requests strictly FIFO, which shifts the decode-batch tkv upward for every new admission. This hurts decoders already in the batch: a long prompt joining behind short decoders drags them onto a larger compiled program and inflates their ITL for the rest of their lifetime.
This PR adds a soft admission gate that skips a candidate whose admission would push the (block-aligned) decode tkv by more than a configurable ratio, giving shorter requests further down the queue a chance to join instead. A per-request skip counter prevents the long request from starving.
SENDNN_INFERENCE_MAX_TKV_SHIFT_RATIO(default 1.5): skip a candidate if new_tkv / current_tkv exceeds this ratio. Set toinfto disable and restore FIFO.SENDNN_INFERENCE_MAX_SKIP_COUNT(default 4): force-admit a request after it has been skipped this many times (anti-starvation).Related Issues
#746
Checklist
bash format.sh)Signed-off-by:line (DCO compliance)