Introduce async scheduler implementation with mixin pattern#941
Introduce async scheduler implementation with mixin pattern#941GOavi101 wants to merge 6 commits into
Conversation
|
👋 Hi! Thank you for contributing. We also recommend installing prek and configuring it to check your code before every local commit. |
1a3ecbb to
b0e8e83
Compare
b0e8e83 to
d71cfb3
Compare
| # The mixin's pre-filter pattern is not safe under that run-ahead scenario. | ||
| # For TP=1 (UniProcExecutor), futures are immediately done so it's safe. | ||
| if parallel_config.world_size > 1: | ||
| scheduler_config.async_scheduling = False |
There was a problem hiding this comment.
Interesting- if we wanted to support this feature then it would likely need to work with TP=4 which is how we run most models. I thought this was only incompatible with pipeline parallel upstream - does it also not work with tensor parallel?
There was a problem hiding this comment.
The fix is SpyreMultiprocExecutor — a thin MultiprocExecutor subclass that overrides max_concurrent_batches to return 1 instead of 2. This forces the engine to use the simpler step() path (strictly schedule → execute → update) rather than step_with_batch_queue, which was the only thing that broke TP>1.
Spyre's forward pass is synchronous, so there's no compute/schedule overlap to lose. The AsyncScheduler base class and its _update_after_schedule TTFT benefit are still fully active — we just removed the run-ahead that its state tracking couldn't handle.
So TP=1, TP=2, and TP=4 should all work with async scheduling now. Not a blocker.
what do you think?
There was a problem hiding this comment.
That doesn't quite line up with my understanding- IIUC the step_with_batch_queue method is what works with the speculative scheduling: The engine runs the scheduler again while the model is running, assuming that the requests in the batch will continue.
Spyre's forward pass is synchronous, so there's no compute/schedule overlap to lose
I don't quite understand this either- the multiproc executor is definitely async, it broadcasts an RPC to the workers to run the model and the engine gets back a future that it waits on. step_with_batch_queue queues up that future so that it can speculatively schedule the next pass.
This TP=1 profile shows the scheduler running in between the model forward passes, the goal with async scheduling is to get the scheduler running for the next step during the model forward pass instead:
The AsyncScheduler base class and its _update_after_schedule TTFT benefit are still fully active — we just removed the run-ahead that its state tracking couldn't handle.
So TP=1, TP=2, and TP=4 should all work with async scheduling now. Not a blocker.
Based on the above, my understanding is that the run-ahead state is the whole point and we won't gain any performance benefit from this unless we support it, so this is a blocker. Is there something else I'm missing?
There was a problem hiding this comment.
You're right, thanks for the correction. I'll fix this — snapshot the mixin's mutable state (ongoing_prefills, tkv, previous_step_was_prefill) before delegating to super().schedule() so the run-ahead second schedule() call sees consistent state, and remove SpyreMultiprocExecutor. That way TP≥2 gets the full async scheduling benefit.
There was a problem hiding this comment.
Yeah, the step_with_batch_queue is actually what makes async scheduling work at all. In the PR where Woosuk added this, he reused the existing batch queue that was originally intended for Pipeline Parallel (PP). With PP you have to wait N steps for each of the N PP stages. But with async scheduling the intention is for the scheduler to be ahead by 1 step, so the num_output_placeholders were introduced to prevent the scheduler from waiting for the current step.
|
Thanks @GOavi101! A few notes:
|
1bd875b to
2246d48
Compare
421b610 to
4777fb6
Compare
Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
4777fb6 to
6af4564
Compare
maxdebayser
left a comment
There was a problem hiding this comment.
Thanks for taking on this issue, @GOavi101. I think perhaps it would better to take a step back to re-think this PR a bit.
There are a few unecessary changes in this PR. I would suggest the following:
- Remove the
AsyncPoolingSpyreSchedulerclass and the code that selects it inplatform.pyas async scheduling is not advantageous for pooling - Undo the mixin class structure. Currently the only purpose of this is for
_is_async_scheduler()to runisinstance(self, AsyncScheduler)but the same can be achieved by returningvllm_config.scheduler_config.async_scheduling
In PR 19970 Woosuk added async scheduling based in this idea from the NanoFlow paper:
Asynchronous scheduling: Batch formation, including estimating memory
usage, scheduling new requests, retiring finished requests, and adjusting the
page table for PagedAttention [17], consumes a non-negligible amount of time
on the CPU side [42]. In most serving frameworks [17,58], only after the GPU
executes one iteration, the scheduler on the CPU is able to detect EOS tokens,
remove the finished request from the batch, and refill the batch with new
requests. However, GPU is under-utilized during this time. To avoid this waste,
NanoFlow asynchronously schedules batch formation in parallel to the GPU
executions. At any iteration i, NanoFlow forms the batch for the next iteration
before the end of the current iteration. This means that NanoFlow cannot
detect the EOS tokens generated at iteration i. After launching iteration i+1,
NanoFlow forms the batch for cycle i+2, detects the EOS token from iteration i,
and removes finished requests. Fortunately, since the average decode length
surpasses 100 for typical workloads (See Table 4), the overhead of one extra
decode token is negligible (< 1%), given the benefit of hiding the batch
formation overhead.
In this first PR, no model runner changes were needed. Later additions were made to overlap the model runner execution with other steps such as sampling and the WorkerBase sample_tokens method was added to separate sampling from the execution of the model.
I think that the main problem to solve is that in Spyre we can't run prefills in the same batch as the decode requests. If it weren't for this fact, probably the only required code change would be implementing the sample_tokens() method in the SpyreWorker.
Since we can currently interleave prefill chunk batches and decode batches, in principle there is no problem in doing so asynchronously.
But once prefill is done, we either must wait and add the request in the current decode batch or in the one after that.
990a1be to
eea3259
Compare
| # not yet committed by ``update_from_output``. The committed prefill | ||
| # position for ``req`` is | ||
| # ``req.num_computed_tokens - self._inflight_prefill_tokens.get(rid, 0)``. | ||
| self._inflight_prefill_tokens: dict[str, int] = {} |
There was a problem hiding this comment.
Why is this necessary? Upstream the scheduler skips the chunks:
if request.is_prefill_chunk:
continue
There was a problem hiding this comment.
The is_prefill_chunk skip in upstream's _update_after_schedule only avoids bumping num_output_placeholders (which is decode-only). The base Scheduler._update_after_schedule still optimistically advances request.num_computed_tokens by num_scheduled_tokens for prefill chunks — that's what enables schedule(N+1) to pick the next chunk while execute(N) is in flight.
We need _inflight_prefill_tokens because on Spyre the committed prefill position after the runner returns is not necessarily equal to the optimistically-scheduled amount: the model runner adjusts for left-padding and prefix-cache hits and reports back what was actually consumed. Without tracking the per-chunk optimistic delta, update_from_output cannot reconcile num_computed_tokens to the runner's actual report, and (under run-ahead) schedule() cannot tell which portion of num_computed_tokens is committed vs. in-flight when deciding whether a request is still in ongoing_prefills.
In sync mode this is a no-op (added and cleared in the same step). It only matters when the async run-ahead inserts a schedule() call between the optimistic advance and the runner's commit.
… collapse mixins - Remove AsyncPoolingSpyreScheduler; pooling models always use the sync PoolingSpyreScheduler (async scheduling is not advantageous for pooling). - Collapse PoolingSpyreMixin / ChunkedPrefillSpyreMixin into direct Scheduler subclasses. AsyncChunkedPrefillSpyreScheduler subclasses (ChunkedPrefillSpyreScheduler, AsyncScheduler). - Replace isinstance-based _is_async_scheduler() with a check on vllm_config.scheduler_config.async_scheduling. - Revert speculative model_runner / worker changes that should land in a follow-up that implements SpyreWorker.sample_tokens(). - Update tests accordingly. Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
eea3259 to
8b4b55e
Compare
…l runner The engine's async scheduling path uses the sample_tokens future result as the model_output passed to scheduler.update_from_output, while the execute_model future result is only checked for None. The previous implementation folded sampling into execute_model and returned EMPTY_MODEL_RUNNER_OUTPUT from sample_tokens, causing real outputs to be silently dropped under async scheduling and triggering a scheduler/runner desync. Split the chunked-prefill runner so execute_model only runs the forward pass and stashes the logits, then sample_tokens consumes the stash, applies the grammar bitmask, runs sampling, and returns the real ModelRunnerOutput. apply_grammar_bitmask now takes grammar_output as a parameter (the previous getattr from scheduler_output never matched anything). A new execute_and_sample helper preserves the combined behaviour for tests and warmup paths that drive the runner directly. The base class gets a default sample_tokens that returns EMPTY_MODEL_RUNNER_OUTPUT for subclasses (e.g. pooling) that fold sampling into the forward pass. Warmup paths in SpyreWorker invoke execute_model directly; after the split they would leak _pending_sample state. Introduce SpyreWorker._execute_and_sample to drain the deferred sampling step and route warmup callsites through it (cleanup paths, which schedule no tokens, keep using execute_model and return a concrete empty output). Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
…er-mixin-pattern Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com> # Conflicts: # sendnn_inference/v1/worker/spyre_worker.py
5ffbab8 to
d24a91b
Compare
d24a91b to
85912af
Compare
- Reconcile optimistic _inflight_prefill_tokens when the runner reports empty req_ids (async incomplete prefill); support generic ModelRunnerOutput via duck-typed padding fields. - Worker: split execute_model/sample_tokens; execute_and_sample helper; dense req_id_to_index for sampled_token_ids; reconcile input_batch for partial decode schedules. - Scheduler TKV tests use execute_and_sample; conftest tolerates read-only cwd for test-sort artifact. - Omit temporary SDBG instrumentation. Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
85912af to
d7d2ac6
Compare
- Add _discard_stale_pending_sample for empty schedules and incomplete-prefill early returns so deferred sampling cannot leak across steps. - Route warmup _cleanup_model_runner through _execute_and_sample for a single contract with ChunkedPrefillModelRunner. - Resolve grammar for masking via explicit branches (engine arg vs Spyre _spyre_grammar_output on SchedulerOutput). - Document dense req_id_to_index for sampled_token_ids vs vLLM scheduler indexing.
Async Scheduling for Spyre Generative Models
Wires Spyre’s chunked-prefill scheduler to vLLM’s upstream AsyncScheduler so async scheduling (run-ahead / batch queue as implemented by the pinned vLLM v1 engine) can be used for generative models. Spyre-specific reconcile and worker changes keep scheduling correct under optimistic num_computed_tokens and related state.
Background
Upstream async scheduling lets the engine overlap scheduling with execution: the scheduler can advance before the previous step’s outputs are fully committed (run-ahead). In v1 this is tied to the engine’s async / batch-queue execution path (see comments in scheduler.py referencing step_with_batch_queue — exact wiring lives in vLLM, not reimplemented here).
Spyre’s ChunkedPrefillSpyreScheduler keeps extra mutable state (ongoing_prefills, TKV / volumetric admission, previous_step_was_prefill, …) that vanilla Scheduler does not own. Base _update_after_schedule optimistically bumps num_computed_tokens for scheduled prefill chunks; the Spyre runner then commits actual progress (left-padding, prefix-cache hits can differ). Without reconciliation, a speculative schedule(N+1) could read wrong committed vs in-flight positions and make bad admission decisions.
What changed
sendnn_inference/v1/core/async_scheduler.py(new file)AsyncChunkedPrefillSpyreScheduleris a thin subclass combiningChunkedPrefillSpyreSchedulerand the upstreamAsyncSchedulervia multiple inheritance. Python's MRO ensuressuper().schedule()hitsAsyncSchedulerbefore the baseScheduler. No pooling async variant is created — async scheduling has no benefit for pooling models.sendnn_inference/v1/core/scheduler.py_inflight_prefill_tokens: dict[str, int]— tracks the optimisticnum_computed_tokensdelta per request between_update_after_scheduleand the runner's actual commit. In sync mode this is a no-op (added and cleared in the same step). Under run-ahead it letsupdate_from_outputreconcile to the runner's real report, and lets the speculativeschedule()call distinguish committed from in-flight tokens when evaluatingongoing_prefillsand TKV admission.update_from_output— reconcilesnum_computed_tokensusing the inflight delta.finish_requests— pops specific request IDs (or clears on finish-all) so aborted or mid-prefill-finished requests don't leave stale entries.assert len(_inflight_prefill_tokens) <= max_num_seqs.sendnn_inference/v1/worker/spyre_worker.pyThe async engine calls
execute_modelandsample_tokensas separate steps.execute_modeldefers its result to_pending_sample;sample_tokensdrains it. Warmup and cleanup paths that must not leave_pending_samplepopulated use_execute_and_sampleinstead of callingexecute_modeldirectly. Grammar masking falls back gracefully when_spyre_grammar_outputis absent.sendnn_inference/platform.pyget_spyre_scheduler_clscentralises class selection, validates non-stringscheduler_clsvalues asSchedulersubclasses, and logs the selected class name at startup so pooling cases cannot misleadingly appear as async-enabled.tests/v1/core/test_scheduler.pyNew and extended tests: sync/async schedule-update cycles,
_inflight_prefill_tokenscorrectness under normal flow and mid-prefill abort, MRO ordering (mro.index(AsyncScheduler) < mro.index(Scheduler)),finish_requestscleanup.Related Issues
Checklist
bash format.sh)Signed-off-by:line (DCO compliance)