feat: add prompt prefix caching to SimpleEngine#90
feat: add prompt prefix caching to SimpleEngine#90panbanda wants to merge 1 commit intowaybarrios:mainfrom
Conversation
07c4c7d to
aeb9614
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
aeb9614 to
e92178f
Compare
|
Really nice work on this. The 15-27x speedup numbers are impressive, and using mlx-lm's I noticed a few things that might need attention:
Issues 1 and 2 together might mean cache hits don't return the right KV state. Happy to help think through the snapshotting approach if that would be useful. The overall design direction is great though, this will be a big win for interactive use cases. |
Summary
SimpleEngine.stream_generate()using mlx-lm'sLRUPromptCache(trie-based prefix matching with LRU eviction)x-anthropic-tracking headers from system blocks in the Anthropic adapter, since they contain per-request hashes that break prefix matchingprompt_cache_sizeparameter (default 10, 0 to disable)Implementation details
vllm_mlx/models/llm.py: Accept optionalprompt_cacheparameter instream_generate()and pass it through tomlx_lm.stream_generate(). Also acceptlist[int](token IDs) as prompt type, not juststr.vllm_mlx/engine/simple.py:LRUPromptCache.fetch_nearest_cache()generate_step()mutates the cache in place), with try/except fallback if deepcopy fails on MLX arraysstop()to prevent stale KV state if the engine is recycledprompt_cache_sizeis configurable (default 10, 0 to disable caching entirely)vllm_mlx/api/anthropic_adapter.py: Skip system content blocks starting withx-anthropic-(e.g.x-anthropic-billing-header: cc_version=...; cch=HASH). Thecch=hash changes every request, causing prefix divergence at token ~33 and defeating the entire cache.tests/test_simple_engine_prefix_cache.py: 7 new tests covering cache miss on first request, disabled cache (size=0), stop() cleanup, configurable size, billing header stripping, and edge cases.Test results (Claude Code agentic loop)
Test plan
prompt_cache_size=0disables caching cleanlystop()clears stale cache stateGenerated with Claude Code