You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for sharing the paper and the accompanying code—it’s been an insightful read!
I had a few questions regarding the implementation of the KV cache in Q-Hitter, particularly its size and quantization:
Paper vs. Code Discrepancy:
The pseudocode in the paper mentions that the KV values are quantized and stored in the KV cache. However, while reviewing the code (specifically modify_llama.py), it appears that the KV values are retained in fp16 format. The 4-bit reduced version of the tensors seems to be used only within the q_score function and is discarded afterward.
Throughput Improvement:
If the KV values are indeed stored in fp16 and not quantized in the cache, could you clarify where the throughput improvement originates? I’m particularly interested in understanding how the 4-bit quantization in the q_score function contributes to the overall performance gains.
Thank you for your time. I look forward to your response.
The text was updated successfully, but these errors were encountered:
Hi,
Thank you for sharing the paper and the accompanying code—it’s been an insightful read!
I had a few questions regarding the implementation of the KV cache in Q-Hitter, particularly its size and quantization:
Paper vs. Code Discrepancy:
The pseudocode in the paper mentions that the KV values are quantized and stored in the KV cache. However, while reviewing the code (specifically modify_llama.py), it appears that the KV values are retained in fp16 format. The 4-bit reduced version of the tensors seems to be used only within the q_score function and is discarded afterward.
Throughput Improvement:
If the KV values are indeed stored in fp16 and not quantized in the cache, could you clarify where the throughput improvement originates? I’m particularly interested in understanding how the 4-bit quantization in the q_score function contributes to the overall performance gains.
Thank you for your time. I look forward to your response.
The text was updated successfully, but these errors were encountered: