Add TurboQuant KV cache compression for MHA#261
Add TurboQuant KV cache compression for MHA#261WindChimeRan merged 1 commit intovllm-project:mainfrom
Conversation
73a42af to
fe52c01
Compare
WindChimeRan
left a comment
There was a problem hiding this comment.
- deep coupling with pagedattention.metal. Suggest to mv tq to a separated kernel file.
- this pr introduce env var. Is it possible to use
--kv-cache-dtypeinstead? - turboquant.py uses import logging directly instead of from vllm.logger import init_logger as the rest of the codebase
|
Thanks for the review - will fix |
497bf6c to
dcd5cf1
Compare
|
Fixed the concerns @WindChimeRan - moved TQ logic to turboquant.metal; added --turboquant-k-quant and --turboquant-v-quant (V is uint3 only so far) args; fixed the logger import and lint errors |
d62068a to
9d148e0
Compare
|
CLI arg name mismatch (--turboquant-k-quant vs --turboquant-k-dtype), wrong env var in serve benchmark (VLLM_METAL_K_QUANT → VLLM_METAL_TURBOQUANT_K_DTYPE), help text typo (uint3 → q3_0) |
|
Thanks @ericcurtin - fixed all the typos |
WindChimeRan
left a comment
There was a problem hiding this comment.
env var should be for platform, not for quantization method.
Please consider the following method, to remove tq env var:
Replace MetalConfig.from_env() with a from_vllm_config(vllm_config) classmethod that reads vllm_config.additional_config:
# In MetalPlatform.check_and_update_config, after existing logic:
add = vllm_config.additional_config or {}
if add.get("turboquant"):
cfg = get_config()
cfg.turboquant = True
cfg.k_quant = add.get("k_quant", "q8_0")
cfg.v_quant = add.get("v_quant", "q3_0")
cfg._validate_turboquant() # reuse the existing validators
Invocation:
vllm serve Llama-3.2-1B \
--additional-config '{"turboquant": true, "k_quant": "q4_0", "v_quant": "q3_0"}'
|
@WindChimeRan new arg method added - uses additional config JSON to activate turboquant and its params |
|
@UndercoverMathGuy test failed. Could you please investigate why? |
|
Test fixes - added getattr to guard against no argument |
|
Modified worker.py for the new cache API and found bug with the GPU KV cache allocation which was still using fp16 for calculation (fixed) |
|
@WindChimeRan tests should pass now. worked locally for me |
|
@UndercoverMathGuy Need to sign to pass DCO |
|
My DCO errors seem to be clear now - @ricky-chaoju |
DCO check is still failing, you'll need to fix it on your side Details: Please refer to the contributing guide: |
|
@ricky-chaoju - I've signed every single one of my own commits but the DCO is erroring cuz of a weird unexpected email error:
|
|
@UndercoverMathGuy The DCO failures are from merge commits rather than your own commits. Could you try rebasing onto main instead of merging? git fetch upstream
git rebase upstream/main
git push --force-with-lease origin mainThat should leave only your own (correctly signed) commits in the PR and clear DCO. |
|
Yep - @ricky-chaoju - thanks so much for the help - DCO green now |
|
Critical: YOCO shared_kv is not None guard removed from non-turboquant path (silent cache corruption for Gemma-3/YOCO models); hardcoded RNG-key/Metal-sign table coupling; duplicated TQ byte calculation; head_size_v integer division may be wrong |
|
@ericcurtin - suggestions fixed thanks - YOCO guard added, hardcoded Metal sign table and RNG validation test (haven't made it dynamic), made TQ byte calculation a function, made a subclass of FullAttentionSpec for Turboquant and used custom byte calculation for KV cache allocation |
|
Non-blocking: the YOCO shared_kv guard sits behind the if kv_cache.turboquant branch, so it's unreachable when TQ is active. Moving the shared_kv is not None check before the TQ branch preserves the guard for all paths. |
|
@ricky-chaoju - fixed - thanks |
WindChimeRan
left a comment
There was a problem hiding this comment.
Lines 68 to 69 in e115dda
Please add docs under the feature tab
|
Non-blocking: This PR put the quant decision inside the attention forward pass rather than at the backend-selection boundary. We'll need a refactoring on attention backend dispatching. TQ should be a different backend in a different file. But this refactoring will introduce a lot of changes. Will do it in future PR. |
WindChimeRan
left a comment
There was a problem hiding this comment.
Non-blocking: upstream vllm described a issue with TQ: vllm-project/vllm#38280
I'm mostly interested in
-
- Hybrid models (Qwen3.5) not supported --> Do we have a fail loud for this?
-
- Minimum 4-bit quantization --> "3-bit and 2-bit quantization produce garbage output. " Is this true in our case?
These are non-blocking for this PR. But if we indeed have the same issue, we need to doc them somewhere (2 --> a new issue, 3 --> documentation).
Signed-off-by: Ruhaan Rajadhyaksha <ruhaanr@gmail.com>
|
@WindChimeRan - fixed head size bug in attention; added documentation for TQ (turboquant.md, added to mkdocs.yaml); the minimum K quant of 4-bit is true here too - any quantization below 4bits causes nonsense from the model - either infinite loops or unrelated ramblings - added this to documentation. On this implementation of turboquant, we are able to support hybrid models (Qwen 3.5 - tested and verified) as the current hybrid model attention framework is just a wrapper over the uncorrelated GDN and normal SDPA frameworks (patch function checks list and directs model to correct attention type) - hence, we can simply activate TQ on the SDPA side without any allocation or other incompatibilities with GDN as they are completely separate. |
|
Thanks @WindChimeRan - would you please close the issue #217?? |
What
Adds opt-in quantized KV cache for MHA paged attention, controlled by two env vars:
VLLM_METAL_TURBOQUANT=1# enables TQ--turboquant-k-quant# key quant type (q8_0 / q5_0 / q4_0 / uint2 / int8 / uint8 / etc.)--turboquant-v-quant# value quant type (q8_0 / q5_0 / q4_0 / q3_0 / q2_0)Why
Unlocks free context length increases with smarter KV cache compression on memory-constrained Apple Silicon (e.g. 8 GB M1 MBA).
3.7x theoretical compression of KV cache (q4_0 quantization vs. bf16)
Quantization quality
Measured on random bf16 tensors (head_dim=128):
Overhead
Single-token encode + cache write on Qwen3-0.6B shape (4 KV heads, hd=128):
Future work