Add Semantic Cache for Gateway Chat Inference

## Summary
Inference calls are costly and slow for repeated or near-duplicate prompts.
Current behavior only benefits from exact-match caching.
This issue adds semantic caching so similar prompts can reuse previous responses.

## Why This Matters
- Reduce average response latency
- Lower inference/API cost
- Improve throughput under repetitive traffic

## Scope
- Add semantic cache middleware in gateway
- Reuse existing embedding adapter from foundation
- Store prompt embeddings and responses in vector cache
- Lookup by similarity before LLM execution
- Insert successful responses after execution
- Make feature configurable and disabled by default

## Proposed Design
1. On incoming chat prompt:
   - Embed prompt using existing embedding adapter
   - Search semantic cache with configurable threshold and top_k
   - If hit, return cached response immediately
2. On cache miss:
   - Execute normal agent flow
   - Store prompt embedding + output for reuse
3. Isolation:
   - Restrict cache hits by agent_id to avoid cross-agent leakage

## Configuration
- semantic_cache_enabled: bool
- semantic_cache_threshold: f32 (example: 0.95)
- semantic_cache_top_k: usize
- semantic_cache_embedding_provider: openai | ollama
- semantic_cache_embedding_model: optional string

## Acceptance Criteria
- Semantic cache middleware exists and is wired in gateway chat path
- Cache lookup occurs before agent execution
- Cache insert occurs after successful execution
- Chat response includes cache metadata fields
- Unit tests cover:
  - semantic hit
  - agent boundary isolation
  - disabled mode
- Gateway builds successfully

## Risks
- False positives if threshold is too low
- In-memory store is process-local and non-persistent
- Embedding provider failures need clear error handling

## Follow-ups
- Add persistent backend option (Qdrant)
- Add TTL/eviction policy
- Add cache hit-rate metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Semantic Cache for Gateway Chat Inference #1611

Summary

Why This Matters

Scope

Proposed Design

Configuration

Acceptance Criteria

Risks

Follow-ups

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Semantic Cache for Gateway Chat Inference #1611

Description

Summary

Why This Matters

Scope

Proposed Design

Configuration

Acceptance Criteria

Risks

Follow-ups

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions