-
-
Notifications
You must be signed in to change notification settings - Fork 1
feat(inference): layer-aligned KV cache management for sequential layer processing #1964
Copy link
Copy link
Closed
Description
Summary
Implement a KV cache management system aligned to layer indices, enabling KV cache reuse across autoregressive generation steps even when only one layer's weights are in GPU memory at a time.
Background
During layer-by-layer inference, model weights are loaded/evicted per layer, but KV caches must persist across the entire forward pass AND across generation steps. KV caches are relatively small compared to model weights (proportional to seq_len × num_heads × head_dim, not model size).
airllm's approach:
# Initialize: one (k_cache, v_cache) pair per transformer layer
kv_cache_list = [([], []) for _ in self.layers]
# During generation, each layer reads/writes its own cache
k_cache, v_cache = past_key_values[layer_idx]
# Layer computes attention using cached keys/values + new token
# Updated cache returned for next generation step
# Trim non-transformer entries (embed, norm, lm_head)
kv_cache_list = kv_cache_list[1:-2] # Only transformer layers have KV cachesOn the RTX 4070 with 8GB VRAM, KV cache size is the main constraint on sequence length for large models. Explicit memory budgeting is essential.
Memory Budget Example (70B model, 8GB VRAM)
- Per-layer KV cache at seq_len=2048: ~32MB (8 KV heads × 128 dim × 2048 seq × 2 (K+V) × fp16)
- 80 layers total: ~2.5GB for full KV cache
- Remaining VRAM for weights: ~5.5GB (enough for one layer at a time)
Key Files
autobot-backend/llm_interface_pkg/optimization/kv_cache.py— newautobot-backend/llm_interface_pkg/providers/— integrate into generation loopautobot-backend/llm_interface_pkg/optimization/memory_manager.py— VRAM budgeting
Acceptance Criteria
- Layer-aligned KV cache with per-layer (k, v) pairs
- KV cache reuse across autoregressive generation steps
- VRAM budget calculator: max sequence length given model + VRAM
- Cache stored on GPU (small enough to coexist with one layer's weights)
- Trim non-transformer entries from cache list
- Support for GQA models (fewer KV heads than Q heads)
- Cache eviction strategy when approaching VRAM limit
Related Issues
- feat(optimization): layer-by-layer inference mode for batch/offline processing #1946 (layer-by-layer inference — KV cache is essential for generation)
- feat(optimization): centralized GPU memory cleanup utility (airllm pattern) #1942 (memory cleanup — clean between generation steps if needed)
Reactions are currently unavailable