What happens when KV cache size is not enough to hold all K, V? #4070
Unanswered
shensimeteor
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm new to LLM inference here so probably a dumb question. Just to understand how KV cache works underlying.
From the SGLang logs, it seems it first load model weights, and then based on rest GPU memory and mem_fraction_static to decide how large the KV cache should be.
If requests come in, and it's found KV cache size is not enough, what SGLand will do?
Thanks in advance!
Again I'm new here so probably asked some questions in error. Please correct me if I'm saying something wrong.
And i'm trying to tune the performance of LLM generation (input context length is ~ 100k, output context length is ~ 1K). Any suggestion on its tuning is appreciated!
Beta Was this translation helpful? Give feedback.
All reactions