Error with using kv_cache_dtype = fp8_e5m2 #4106
Answered
by
Fridge003
shensimeteor
asked this question in
Q&A
-
Hi everyone, Using tag 0.4.3 I'm trying to use KV cache quantization to speed up my batch inference. However I got below error after adding Error
Model architecture is similar to LLama2 (GQA)
|
Beta Was this translation helpful? Give feedback.
Answered by
Fridge003
Mar 6, 2025
Replies: 1 comment 1 reply
-
Being helped on slack channel: the reason is flashinter doesn't yet support kv cache quantization now. Changed to |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, #4147 has fixed the flashinfer bug. Please pull the latest main branch and try again.