Skip to content

feat(inference): layer-aligned KV cache management for sequential layer processing #1964

@mrveiss

Description

@mrveiss

Summary

Implement a KV cache management system aligned to layer indices, enabling KV cache reuse across autoregressive generation steps even when only one layer's weights are in GPU memory at a time.

Background

During layer-by-layer inference, model weights are loaded/evicted per layer, but KV caches must persist across the entire forward pass AND across generation steps. KV caches are relatively small compared to model weights (proportional to seq_len × num_heads × head_dim, not model size).

airllm's approach:

# Initialize: one (k_cache, v_cache) pair per transformer layer
kv_cache_list = [([], []) for _ in self.layers]

# During generation, each layer reads/writes its own cache
k_cache, v_cache = past_key_values[layer_idx]
# Layer computes attention using cached keys/values + new token
# Updated cache returned for next generation step

# Trim non-transformer entries (embed, norm, lm_head)
kv_cache_list = kv_cache_list[1:-2]  # Only transformer layers have KV caches

On the RTX 4070 with 8GB VRAM, KV cache size is the main constraint on sequence length for large models. Explicit memory budgeting is essential.

Memory Budget Example (70B model, 8GB VRAM)

  • Per-layer KV cache at seq_len=2048: ~32MB (8 KV heads × 128 dim × 2048 seq × 2 (K+V) × fp16)
  • 80 layers total: ~2.5GB for full KV cache
  • Remaining VRAM for weights: ~5.5GB (enough for one layer at a time)

Key Files

  • autobot-backend/llm_interface_pkg/optimization/kv_cache.py — new
  • autobot-backend/llm_interface_pkg/providers/ — integrate into generation loop
  • autobot-backend/llm_interface_pkg/optimization/memory_manager.py — VRAM budgeting

Acceptance Criteria

  • Layer-aligned KV cache with per-layer (k, v) pairs
  • KV cache reuse across autoregressive generation steps
  • VRAM budget calculator: max sequence length given model + VRAM
  • Cache stored on GPU (small enough to coexist with one layer's weights)
  • Trim non-transformer entries from cache list
  • Support for GQA models (fewer KV heads than Q heads)
  • Cache eviction strategy when approaching VRAM limit

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions