Regarding Quantization Of The KV Cache #3

akshatsh49 · 2025-02-05T19:36:34Z

Hi,

Thank you for sharing the paper and the accompanying code—it’s been an insightful read!

I had a few questions regarding the implementation of the KV cache in Q-Hitter, particularly its size and quantization:

Paper vs. Code Discrepancy:
The pseudocode in the paper mentions that the KV values are quantized and stored in the KV cache. However, while reviewing the code (specifically modify_llama.py), it appears that the KV values are retained in fp16 format. The 4-bit reduced version of the tensors seems to be used only within the q_score function and is discarded afterward.
Throughput Improvement:
If the KV values are indeed stored in fp16 and not quantized in the cache, could you clarify where the throughput improvement originates? I’m particularly interested in understanding how the 4-bit quantization in the q_score function contributes to the overall performance gains.

Thank you for your time. I look forward to your response.

Provide feedback