feat: First pass at llama_kv_cache_hybrid #13276

gabe-l-hart · 2025-05-02T23:08:10Z

Description

This implementation covers both llama_memory_i and llama_kv_cache interfaces, but they could very well not be correct.

Discussion

I'm putting this up for discussion even though it doesn't have much value as standalone. My ultimate goal is support for the just-released granite 4 which is a combination of mamba2 and granitemoeshared layers. I opened #13275 to track the full scope of model architecture changes.

This implementation covers both `llama_memory_i` and `llama_kv_cache` interfaces, but they could very well not be correct. Branch: HybridCache Signed-off-by: Gabe Goodhart <[email protected]>

compilade

Awesome to see this progress!

compilade · 2025-05-02T23:14:35Z

src/llama-kv-cache.cpp

+    // TODO: Will it cause problems if some caches are able to remove the seq
+    //  but others aren't?


Yes it will cause problems if this breaks the coherency between caches. (e.g. part of a sequence is removed in one cache but not the other).

This is what I was referring to in #12799 (comment) when I wrote:

The hardest part will be handling errors and properly keeping coherency between the different types of caches (because they don't necessarily roll-back states in the same way).

I think the seq_rm API might fundamentally be too specific to self-attention KV cache. Recurrent models can't rollback their state, because intermediate states are not kept since keeping them for all tokens would take too much space. (when seq_rm returns false, it means the states have to be re-calculated from scratch for the affected sequence (at least that was the intention in #5328))

Ideally, if there was some API to create snapshots and rollback to them, the implementation would be simpler for recurrent models (and for hybrid models by extension). (technically, sequences (with seq_id) already kind of do this (and are copy-on-write), but snapshots within sequences might be more convenient to manage in user code, since managing which state is the latest per sequence could be done transparently)

But that would also mean having to manage the lifetime of explicit state snapshots (in examples/server/server.cpp among others) instead of directly dealing with ranges of token positions (and might make things like largest-common-prefix context caching harder to handle). I've previously shared some ideas about state snapshots/checkpoints in #7531 (comment) (although the first half of the comment is about session restore as in state_read).

compilade · 2025-05-02T23:44:16Z

src/llama-kv-cache.cpp

+    // If any of the caches are recurrent, require simple split
+    return llama_sbatch(batch, m_hparams.n_embd, m_has_recurrent, logits_all);


Simple split should not be used with recurrent models, they expect equal split.

See #7531 (comment) which illustrates the splits

Suggested change

// If any of the caches are recurrent, require simple split

return llama_sbatch(batch, m_hparams.n_embd, m_has_recurrent, logits_all);

// If any of the caches are recurrent, require non-simple split

return llama_sbatch(batch, m_hparams.n_embd, !m_has_recurrent, logits_all);

compilade · 2025-05-02T23:45:24Z

src/llama-kv-cache.cpp

+    if (m_has_recurrent) {
+        return sbatch.split_simple(n_ubatch);
+    }


This will not work, recurrent models expect split_equal to be used.

compilade · 2025-05-02T23:47:49Z

src/llama-kv-cache.cpp

+    // TODO: Is this correct?
+    // If any children can shift, return true
+    for (const auto & cache : m_children) {
+        if (cache->get_can_shift()) {
+            return true;
+        }
+    }


Maybe this should be if all children can shift, then return true.

But as you've noticed elsewhere, can_shift should technically always be true for all currently-implemented cache types, so I don't know if that part of the API will stay anyway.

feat: First pass at llama_kv_cache_hybrid

7778b5b

This implementation covers both `llama_memory_i` and `llama_kv_cache` interfaces, but they could very well not be correct. Branch: HybridCache Signed-off-by: Gabe Goodhart <[email protected]>

compilade reviewed May 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: First pass at llama_kv_cache_hybrid #13276

feat: First pass at llama_kv_cache_hybrid #13276

gabe-l-hart commented May 2, 2025

compilade left a comment

compilade May 2, 2025

compilade May 2, 2025 •

edited

Loading

compilade May 2, 2025

compilade May 2, 2025 •

edited

Loading

		// TODO: Will it cause problems if some caches are able to remove the seq
		// but others aren't?

		// If any of the caches are recurrent, require simple split
		return llama_sbatch(batch, m_hparams.n_embd, m_has_recurrent, logits_all);

feat: First pass at llama_kv_cache_hybrid #13276

Are you sure you want to change the base?

feat: First pass at llama_kv_cache_hybrid #13276

Conversation

gabe-l-hart commented May 2, 2025

Description

Discussion

compilade left a comment

Choose a reason for hiding this comment

compilade May 2, 2025

Choose a reason for hiding this comment

compilade May 2, 2025 • edited Loading

Choose a reason for hiding this comment

compilade May 2, 2025

Choose a reason for hiding this comment

compilade May 2, 2025 • edited Loading

Choose a reason for hiding this comment

compilade May 2, 2025 •

edited

Loading

compilade May 2, 2025 •

edited

Loading