Skip to content

Add Semantic Cache for Gateway Chat Inference #1611

@FireFistisDead

Description

@FireFistisDead

Summary

Inference calls are costly and slow for repeated or near-duplicate prompts.
Current behavior only benefits from exact-match caching.
This issue adds semantic caching so similar prompts can reuse previous responses.

Why This Matters

  • Reduce average response latency
  • Lower inference/API cost
  • Improve throughput under repetitive traffic

Scope

  • Add semantic cache middleware in gateway
  • Reuse existing embedding adapter from foundation
  • Store prompt embeddings and responses in vector cache
  • Lookup by similarity before LLM execution
  • Insert successful responses after execution
  • Make feature configurable and disabled by default

Proposed Design

  1. On incoming chat prompt:
    • Embed prompt using existing embedding adapter
    • Search semantic cache with configurable threshold and top_k
    • If hit, return cached response immediately
  2. On cache miss:
    • Execute normal agent flow
    • Store prompt embedding + output for reuse
  3. Isolation:
    • Restrict cache hits by agent_id to avoid cross-agent leakage

Configuration

  • semantic_cache_enabled: bool
  • semantic_cache_threshold: f32 (example: 0.95)
  • semantic_cache_top_k: usize
  • semantic_cache_embedding_provider: openai | ollama
  • semantic_cache_embedding_model: optional string

Acceptance Criteria

  • Semantic cache middleware exists and is wired in gateway chat path
  • Cache lookup occurs before agent execution
  • Cache insert occurs after successful execution
  • Chat response includes cache metadata fields
  • Unit tests cover:
    • semantic hit
    • agent boundary isolation
    • disabled mode
  • Gateway builds successfully

Risks

  • False positives if threshold is too low
  • In-memory store is process-local and non-persistent
  • Embedding provider failures need clear error handling

Follow-ups

  • Add persistent backend option (Qdrant)
  • Add TTL/eviction policy
  • Add cache hit-rate metrics

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions