What happens when KV cache size is not enough to hold all K, V? #4070

shensimeteor · 2025-03-04T18:07:24Z

shensimeteor
Mar 4, 2025

I'm new to LLM inference here so probably a dumb question. Just to understand how KV cache works underlying.

From the SGLang logs, it seems it first load model weights, and then based on rest GPU memory and mem_fraction_static to decide how large the KV cache should be.

If requests come in, and it's found KV cache size is not enough, what SGLand will do?

Will it offload the cache to the CPU memory or disk? or
Will it stop prefilling K/V: so each time when decoding, it will recalculate some k/v on the fly (repeatedly)? or
It will not accept new requests until previous request is fullfilled and some GPU memory got freeed up?

Thanks in advance!
Again I'm new here so probably asked some questions in error. Please correct me if I'm saying something wrong.

And i'm trying to tune the performance of LLM generation (input context length is ~ 100k, output context length is ~ 1K). Any suggestion on its tuning is appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What happens when KV cache size is not enough to hold all K, V? #4070

{{title}}

Replies: 0 comments

Select a reply

What happens when KV cache size is not enough to hold all K, V? #4070

shensimeteor Mar 4, 2025

Replies: 0 comments

shensimeteor
Mar 4, 2025