feat(inference): layer-aligned KV cache management for sequential layer processing

## Summary

Implement a KV cache management system aligned to layer indices, enabling KV cache reuse across autoregressive generation steps even when only one layer's weights are in GPU memory at a time.

## Background

During layer-by-layer inference, model weights are loaded/evicted per layer, but KV caches must persist across the entire forward pass AND across generation steps. KV caches are relatively small compared to model weights (proportional to `seq_len × num_heads × head_dim`, not model size).

airllm's approach:

```python
# Initialize: one (k_cache, v_cache) pair per transformer layer
kv_cache_list = [([], []) for _ in self.layers]

# During generation, each layer reads/writes its own cache
k_cache, v_cache = past_key_values[layer_idx]
# Layer computes attention using cached keys/values + new token
# Updated cache returned for next generation step

# Trim non-transformer entries (embed, norm, lm_head)
kv_cache_list = kv_cache_list[1:-2]  # Only transformer layers have KV caches
```

On the RTX 4070 with 8GB VRAM, KV cache size is the main constraint on sequence length for large models. Explicit memory budgeting is essential.

## Memory Budget Example (70B model, 8GB VRAM)

- Per-layer KV cache at seq_len=2048: ~32MB (8 KV heads × 128 dim × 2048 seq × 2 (K+V) × fp16)
- 80 layers total: ~2.5GB for full KV cache
- Remaining VRAM for weights: ~5.5GB (enough for one layer at a time)

## Key Files

- `autobot-backend/llm_interface_pkg/optimization/kv_cache.py` — new
- `autobot-backend/llm_interface_pkg/providers/` — integrate into generation loop
- `autobot-backend/llm_interface_pkg/optimization/memory_manager.py` — VRAM budgeting

## Acceptance Criteria

- [ ] Layer-aligned KV cache with per-layer (k, v) pairs
- [ ] KV cache reuse across autoregressive generation steps
- [ ] VRAM budget calculator: max sequence length given model + VRAM
- [ ] Cache stored on GPU (small enough to coexist with one layer's weights)
- [ ] Trim non-transformer entries from cache list
- [ ] Support for GQA models (fewer KV heads than Q heads)
- [ ] Cache eviction strategy when approaching VRAM limit

## Related Issues

- #1946 (layer-by-layer inference — KV cache is essential for generation)
- #1942 (memory cleanup — clean between generation steps if needed)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(inference): layer-aligned KV cache management for sequential layer processing #1964

Summary

Background

Memory Budget Example (70B model, 8GB VRAM)

Key Files

Acceptance Criteria

Related Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

feat(inference): layer-aligned KV cache management for sequential layer processing #1964

Description

Summary

Background

Memory Budget Example (70B model, 8GB VRAM)

Key Files

Acceptance Criteria

Related Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions