Summary
Inference calls are costly and slow for repeated or near-duplicate prompts.
Current behavior only benefits from exact-match caching.
This issue adds semantic caching so similar prompts can reuse previous responses.
Why This Matters
- Reduce average response latency
- Lower inference/API cost
- Improve throughput under repetitive traffic
Scope
- Add semantic cache middleware in gateway
- Reuse existing embedding adapter from foundation
- Store prompt embeddings and responses in vector cache
- Lookup by similarity before LLM execution
- Insert successful responses after execution
- Make feature configurable and disabled by default
Proposed Design
- On incoming chat prompt:
- Embed prompt using existing embedding adapter
- Search semantic cache with configurable threshold and top_k
- If hit, return cached response immediately
- On cache miss:
- Execute normal agent flow
- Store prompt embedding + output for reuse
- Isolation:
- Restrict cache hits by agent_id to avoid cross-agent leakage
Configuration
- semantic_cache_enabled: bool
- semantic_cache_threshold: f32 (example: 0.95)
- semantic_cache_top_k: usize
- semantic_cache_embedding_provider: openai | ollama
- semantic_cache_embedding_model: optional string
Acceptance Criteria
- Semantic cache middleware exists and is wired in gateway chat path
- Cache lookup occurs before agent execution
- Cache insert occurs after successful execution
- Chat response includes cache metadata fields
- Unit tests cover:
- semantic hit
- agent boundary isolation
- disabled mode
- Gateway builds successfully
Risks
- False positives if threshold is too low
- In-memory store is process-local and non-persistent
- Embedding provider failures need clear error handling
Follow-ups
- Add persistent backend option (Qdrant)
- Add TTL/eviction policy
- Add cache hit-rate metrics
Summary
Inference calls are costly and slow for repeated or near-duplicate prompts.
Current behavior only benefits from exact-match caching.
This issue adds semantic caching so similar prompts can reuse previous responses.
Why This Matters
Scope
Proposed Design
Configuration
Acceptance Criteria
Risks
Follow-ups