A complete, runnable, multi-agent RAG application demonstrating the production patterns senior GenAI engineers ship: hybrid retrieval, agentic tool use, human-in-the-loop, multi-turn state with persistence, structured outputs, multi-agent orchestration, two-tier caching, structured event streaming, and a measurement harness on top of all of it.
Built incrementally across 20 focused steps β each one introduces one concept, produces something runnable, and ties back to senior-level interview talking points.
POST /agent {thread_id, question}
β
βΌ
ββββββββββββββββ
β supervisor β classifies intent β typed RouteDecision
ββββββββ¬ββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββββ ββββββββββββββββ βββββββββββββββββ
β policy_agent β β ops_agent β β general_agent β
β (RAG) β β (calc + β β (LLM-only β
β β β refund) β β fallback) β
β ββββββββββββ β β β β β
β β Redis β β β ToolNode β β β
β β cache β β β + HITL on β β β
β β (2-tier) β β β risky tools β β β
β ββββββ¬ββββββ β ββββββββ¬ββββββββ βββββββββ¬ββββββββ
β β miss β β β
β βΌ β βΌ βΌ
β ββββββββββββ β tool execution simple LLM
β β Hybrid β β (interrupts on response
β β retrievalβ β submit_refund_ (general
β β BM25+ β β request) knowledge,
β β dense β β smalltalk)
β β + rerank β β
β ββββββ¬ββββββ β
β βΌ β
β Structured β
β RagAnswer β
β (Pydantic) β
βββββββββ¬βββββββββ
β
ββββββββββββββββββββ¬βββββββββββββββββββ
βΌ
βββββββββββββββββββββ
β /agent/stream β β SSE structured events
β /agent β β JSON response
βββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββ
β Cross-cutting: β
β β’ Postgres (LangGraph checkpoints) β
β β’ Redis (cache) β
β β’ LangSmith (tracing) β
β β’ RAGAS (eval harness, regression detect) β
β β’ AWS Bedrock (Claude Sonnet 4.5) β
β β’ Ollama fallback (Gemma 4) via env flip β
ββββββββββββββββββββββββββββββββββββββββββββββ
Three specialist agents, one supervisor, one knowledge base, one cache, one eval harness, one streaming endpoint. Built with the patterns you'd actually defend in a senior interview.
| Layer | Component | Step |
|---|---|---|
| LLM | Claude Sonnet 4.5 via Bedrock (Gemma 4 fallback via Ollama) | 0, 1, 15 |
| Embeddings | Amazon Titan v2 (nomic-embed-text fallback) | 2, 15 |
| Vector DB | Chroma with metadata filtering | 2, 3 |
| Retrieval | Hybrid BM25 + dense + Cohere/BGE reranker | 11 |
| Generation | Structured RagAnswer via tool-binding + forced tool choice |
4, 17 |
| Agent runtime | LangGraph stateful graphs, three subgraph specialists | 7, 18 |
| Tool use | Calculator, KB-search, refund submission with selective HITL | 6, 14 |
| Routing | LLM-based intent classification, typed RouteDecision |
18 |
| State | Postgres checkpoints, multi-turn conversation memory | 13 |
| Compression | Auto-summarization of older turns past token threshold | 16 |
| Caching | Two-tier Redis cache (exact + semantic, per-specialist policies) | 19 |
| Streaming | Server-Sent Events with structured event types | 20 |
| Observability | LangSmith tracing with tags, metadata, per-step spans | 10 |
| Evaluation | RAGAS harness, golden dataset, regression detection script | 12, 18 |
| API | FastAPI with Pydantic validation, lifespan pre-warming | 5, 8 |
# 1. Prerequisites
brew install postgresql@16 redis # or run via Docker
docker run -d --name pg -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=genai -p 5432:5432 postgres:16
docker run -d --name redis -p 6379:6379 redis:7-alpine
# 2. AWS Bedrock setup (one-time)
aws configure # standard credentials
# In Bedrock console β Model access β request Claude Sonnet 4.5 + Titan v2
# 3. Environment
cp .env.example .env
# Set: AWS_REGION, BEDROCK_LLM_MODEL, BEDROCK_EMBED_MODEL,
# POSTGRES_DSN, REDIS_URL, LANGCHAIN_API_KEY, COHERE_API_KEY
# 4. Python
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 5. Index the knowledge base
python scripts/ingest.py
# 6. Run
uvicorn app.main:app --reload --port 8000Then in another terminal:
# Single-turn agent invocation
curl -X POST http://localhost:8000/agent \
-H "Content-Type: application/json" \
-d '{"thread_id": "demo", "question": "What is the refund window?"}'
# Streaming with structured events
curl -N -X POST http://localhost:8000/agent/stream \
-H "Content-Type: application/json" \
-d '{"thread_id": "demo2", "question": "Compare SmartHub Pro and Lite"}'
# Browser demo for streaming UX
open http://localhost:8000/static/stream_demo.html
# Run the eval harness
python scripts/run_eval.py --tag baseline
python scripts/eval_check.py # CI-style regression gateFor the fallback path (no AWS, runs entirely on your laptop):
LLM_PROVIDER=ollama uvicorn app.main:app --reloadOllama with Gemma 4 + nomic-embed-text + Chroma still works for everything except the cloud-specific bits.
Step ordering is deliberate. Each one is the right next problem given what came before:
FOUNDATIONS (0-8)
Setup β bare LLM β vector store β ingest β naive RAG
β FastAPI β tools β agent graph β endpoint
β
OPERATIONAL DEPTH (10-12)
β Observability before refactoring (you'll need traces to verify
β the next changes worked)
β Hybrid + reranker (the production retrieval pattern)
β RAGAS eval (now changes have a measurable answer)
β
STATE & SAFETY (13-14)
β Postgres checkpoints + multi-turn (real chatbot UX)
β Selective HITL (safe to ship destructive tools)
β
PRODUCTION POSTURE (15-17)
β Bedrock migration (production-grade LLM/embeddings)
β Conversation summarization (bounds token cost)
β Structured outputs (typed contracts on boundaries)
β
ARCHITECTURE & UX (18-20)
β Multi-agent supervisor (capability scales without prompt bloat)
β Two-tier caching (cost + latency win)
β Streaming with structured events (production UX)
This isn't the order most tutorials teach, but it's the order a senior engineer would actually build it. Observability before optimization. Eval before refactor. Foundations stable before adding capability. Production posture before deployment.
| # | Step | What it adds | Key concept |
|---|---|---|---|
| 0 | Setup | Ollama + project skeleton + config | Tooling foundation |
| 1 | LLM baseline | ChatOllama, prompt templates |
Provider abstraction |
| 2 | Vector store | Chroma + embeddings + metadata filter | Semantic search |
| 3 | Ingestion | Recursive chunking, metadata tagging | Chunking strategy |
| 4 | Naive RAG | LCEL retrieve β prompt β generate | The RAG primitive |
| 5 | FastAPI surface | Endpoints, Pydantic validation, lifespan | Service shape |
| 6 | Tools | @tool, schema generation, RAG-as-tool |
Function-calling |
| 7 | LangGraph agent | StateGraph, conditional edges, ReAct loop |
Stateful graphs |
| 8 | Agent endpoint | End-to-end three-tier API | Composition |
| 10 | Observability | LangSmith traces with tags + metadata | Distributed tracing for LLMs |
| 11 | Hybrid + rerank | BM25 + dense ensemble + Cohere/BGE | Production retrieval |
| 12 | Eval harness | RAGAS metrics, golden dataset, regression script | Measurement |
| 13 | Checkpoints | Postgres-backed multi-turn state | Stateful agents |
| 14 | HITL | Selective interrupt() on risky tools, approve/reject/edit |
Production safety |
| 15 | Bedrock | Provider switching (Bedrock β Ollama via env var) | Multi-provider posture |
| 16 | Summarization | Token-budget-triggered compression with RemoveMessage |
Custom reducers |
| 17 | Structured outputs | Manual tool-binding, forced tool choice, Pydantic boundary | Typed contracts |
| 18 | Supervisor | Three specialist subgraphs + LLM-routed dispatch | Multi-agent architecture |
| 19 | Caching | Two-tier Redis (exact + semantic), per-specialist policies | Cost/latency wins |
| 20 | Streaming | SSE with structured event types | Production UX |
(Step 9 is the original "where to next" roadmap; steps 10-20 implement extensions from it.)
After working through this, you can articulate:
- "How do you debug LLM apps?" β distributed tracing via LangSmith; tags + metadata for filtering; trace-based eval (Step 10)
- "How do you measure RAG quality?" β RAGAS harness with faithfulness, relevancy, context-precision, context-recall; golden dataset curation; regression-blocking CI gate (Step 12)
- "What's your retrieval setup?" β hybrid BM25+dense via
EnsembleRetriever, Cohere reranker on top, measurable improvement vs. dense-only (Step 11) - "How do you handle long conversations?" β Postgres checkpoints by
thread_id; auto-summarization past 3K-token threshold usingRemoveMessagereducer (Steps 13, 16) - "How do you make destructive actions safe?" β selective
interrupt()onRISKY_TOOLS; approve/reject/edit semantics; durable pause via checkpoints (Step 14) - "How do you scale capability without prompt bloat?" β multi-agent supervisor with subgraph specialists; typed
RouteDecisioncontract; per-specialist tool lists and prompts (Step 18) - "How do you cut LLM costs?" β two-tier cache (exact + semantic) with per-specialist TTL policies; threshold tuning against eval set; cache-invalidation strategy (Step 19)
- "How do you ship streaming?" β SSE with structured event taxonomy (
token,tool_start,cache_hit, etc.); tag-based filtering for selective streaming;X-Accel-Buffering: nofor proxy compatibility; cancellation handling (Step 20) - "Why these specific tools/models?" β multi-provider via env-var switch; Bedrock for prod (managed, IAM, regional); Ollama for offline dev; tradeoffs measurable in same eval harness (Step 15)
Each answer is backed by code in this repo and traces in LangSmith. That's the difference between "I read about agents" and "I've built and operated agents."
genai-rag-app/
βββ app/
βββ main.py β FastAPI surface
β βββ agents/
β β βββ policy.py β RAG specialist subgraph
β β βββ ops.py β tool-using specialist subgraph
β β βββ general.py β LLM-only fallback
β βββ agent_graph.py β parent graph: supervisor + specialists
β βββ agent_graph_helpers.py β shared HITL/tools logic
β βββ supervisor.py β intent classification node
β βββ cache.py β two-tier Redis cache
β βββ checkpointer.py β Postgres checkpointer + connection pool
β βββ streaming.py β SSE event-translation layer
β βββ eval.py β RAGAS harness
β βββ llm.py β provider-switching factories
β βββ retrieval.py β hybrid + rerank pipeline
β βββ rag_chain.py β structured RAG via tool binding
β βββ tools.py β @tool definitions
β βββ schemas.py β Pydantic types (RagAnswer, RouteDecision)
β βββ vector_store.py β Chroma wrapper
β βββ config.py β single source of truth for tunables
βββ routes.py β FastAPI routes
βββ data/
β βββ docs/ β knowledge base source files
β βββ chroma/ β vector index (generated)
β βββ eval/
β βββ golden.jsonl β evaluation dataset
β βββ results/ β per-run CSVs + aggregate trend
βββ scripts/
β βββ ingest.py β chunk + embed + index
β βββ run_eval.py β run RAGAS, write CSV
β βββ eval_check.py β regression detection (CI-style)
β βββ NN_*_smoke.py β per-step smoke tests
βββ static/
β βββ stream_demo.html β browser SSE demo
βββ steps/
βββ 00-20.md β the build guide
This system is feature-complete from a portfolio standpoint, but production readiness needs more. The honest list:
- Authentication / multi-tenancy.
thread_idshould be derived from authenticated identity, not trusted from the client. Skipped for teaching clarity; called out in Step 13's interview talking points. - Per-tenant data isolation. Production multi-tenant RAG needs per-tenant collections (or at minimum, per-tenant metadata filtering on retrieval). Not built here.
- Deployment. No Docker compose, no Lambda/SAM, no Terraform. The architecture is deploy-ready (stateless API, Postgres + Redis externalized, AWS for inference); turning it into a deployment is its own project.
- Cost-aware routing. The supervisor uses Sonnet for every classification. Production would route trivially-classifiable queries with a cheaper model (Haiku, or even an embedding-similarity classifier).
- Conversation summarization at the parent-graph level. Step 16 wired summarization into the pre-supervisor agent graph; the Step 18 refactor to specialists left it unintegrated. Re-integration is a one-page extension.
- Production-grade vector search. Chroma is fine to ~100K chunks. Past that, Qdrant, Pinecone, or pgvector with HNSW are the next step. The retriever interface is the swap boundary; one file changes.
- Audit trail beyond LangSmith. HITL approve/reject events should also write to a structured audit log for compliance contexts. Not built here.
- Fine-tuning. Out of scope; the agent uses prompting + tool design rather than custom-trained models. For most use cases, that's the right call.
These are the conversations the system gets you to, not gaps that invalidate it.
Reading order: README β Step 0 β work through sequentially. Don't skip steps. Each one assumes the previous one is working; running into a problem from Step N+2 because you skipped N is a frustrating way to spend an evening.
Per-step structure (every step follows the same template):
- Goal β what you're building, in one sentence
- What you're building β diagram + file list, before any code
- Mental model β why this is the next right thing
- Code sections β typed in, not just pasted; the typing helps retention
- Verification β concrete test that proves it works
- Senior framing β interview talking points specific to that step
- Troubleshooting β every gotcha I hit, with the fix
If a step takes you longer than the time estimate, you're probably learning more than the estimate accounted for. That's fine. The point is the depth, not the speed.
The implementation choices that distinguish this from "another RAG tutorial":
- Manual tool binding for structured outputs (Step 17) instead of
with_structured_output()β teaches the underlying mechanism so you can drop down when the high-level API breaks. - Subgraphs for specialists (Step 18) instead of node functions β independently testable, deployable, and replaceable; mirrors the monolith-to-services architectural shift.
- Selective HITL (Step 14) instead of blanket
interrupt_beforeβ the production pattern; pause only on tools with side effects. - Custom
RemoveMessage-based summarization (Step 16) instead of "just truncate the list" β shows the LangGraph reducer model under the hood. - Two-tier cache with exact-then-semantic (Step 19) instead of one-tier or LangChain's built-in cache β demonstrates the production hierarchy and threshold-tuning discussion.
- Tag-based event filtering for streaming (Step 20) instead of streaming everything β prevents leaking supervisor reasoning to users.
- Eval harness with regression-detection script (Step 12) instead of one-shot evals β the CI-gate pattern.
Each is a small choice that signals familiarity with what production looks like, not just what tutorials show.
Built and tested against:
- Python 3.13
- LangChain, LangGraph, langchain-aws
- Postgres 16, Redis 7
- Claude Sonnet 4.5 via Bedrock (
us.anthropic.claude-sonnet-4-5-20250929-v1:0) - Amazon Titan v2 embeddings
If you find a step that's drifted out of date, the troubleshooting section in each step is the first place to check.