[Paged KV] Enable prefix caching on the unified paged path#283
Conversation
Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>
WindChimeRan
left a comment
There was a problem hiding this comment.
maybe we should put the test of Paged Prefix Cache into separated files.
There was a problem hiding this comment.
This will turn on prefix caching as default for user. Could you please run a quick benchmark comparing with and without prefix caching? We can check TTFT, throughput, and cache hit rate, to see if it actually works. For the dataset, concatenating a shared system prompt should be fine.
There was a problem hiding this comment.
Benchmark with vllm bench serve --dataset-name prefix_repetition on Qwen3-0.6B (paged attention). Workload: 1 shared prefix of 512 tokens, 50 prompts each with 128-token suffix and 100-token output; 10 repeats per side, median ± pstdev reported.

| Metric | Cache off (med ± sd) | Cache on (med ± sd) | Δ |
|---|---|---|---|
| Throughput (tok/s) | 210.59 ± 8.93 | 288.88 ± 12.23 | 1.37× |
| TTFT mean (ms) | 8667.88 ± 588.32 | 6071.39 ± 554.66 | 1.43× |
| TTFT P50 (ms) | 8622.69 ± 645.68 | 5894.11 ± 451.23 | 1.46× |
| TTFT P99 (ms) | 16308.74 ± 1049.18 | 11523.16 ± 943.63 | 1.42× |
| TPOT P50 (ms) | 30.74 ± 1.63 | 22.88 ± 0.77 | 1.34× |
| TPOT P99 (ms) | 119.51 ± 51.95 | 25.62 ± 8.99 | 4.67× |
| E2EL P50 (ms) | 11574.40 ± 664.36 | 8141.27 ± 652.99 | 1.42× |
Cache hit rate (from /metrics):
- Cold (run 1): 78.4% (25088 / 32000)
- Warm (runs 2-10, each): 97.5% (31200 / 32000)
- Cumulative across 10 cache-on runs: 95.6% (305888 / 320000)
baseline-* runs report 0 queries / 0 hits (cache disabled).
(On TPOT P99: cache-on stable across 10 runs (std ±9 ms, range 23-50). Baseline noisy (std ±52 ms, range 36-197). Single-sweep ratio floats ~3.5×-5.5× because of baseline noise, not cache-side flakiness. The distribution-based 4.67× is the robust number.)
Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>
There was a problem hiding this comment.
No cache hit assert
Suggest asserting on scheduler cache-hit counters or runner._request_states block reuse. Even a single assert computed_tokens > 0 on at least one req
Please add a followup patch for the doc #284 |
Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>
Will follow up once #284 lands. |
Summary
Removes the platform-layer force-disable that was blocking
enable_prefix_cachingon the paged path. The unified prefill code already handlesnum_computed_tokens > 0(#195, #207, #208, #211); the only remaining gap was thatplatform.py:278-285overrode it.Adds an end-to-end correctness test that fires identical prompts twice through
vllm.LLM(enable_prefix_caching=True)— the second pass walks thestart_pos > 0path; tokens still match the cache-off golden.Both test classes share a single
LLMfixture: Metal memory held by a releasedLLMis not freed by Python gc, so a second module-scopeLLMwould hitkv_budget=0.Hybrid models
Upstream
ModelConfig.is_prefix_caching_supportedalready returns False for hybrid/Mamba models, so thedefault_prefix_cachingresolution invllm/engine/arg_utils.pykeeps cache off unless the user explicitly forces it. No vllm-metal-side guard needed.Benchmark
Closes #182.