Production-Shaped GenAI: A Multi-Agent System Built in 20 Steps

A complete, runnable, multi-agent RAG application demonstrating the production patterns senior GenAI engineers ship: hybrid retrieval, agentic tool use, human-in-the-loop, multi-turn state with persistence, structured outputs, multi-agent orchestration, two-tier caching, structured event streaming, and a measurement harness on top of all of it.

Built incrementally across 20 focused steps — each one introduces one concept, produces something runnable, and ties back to senior-level interview talking points.

What this system actually does

                       POST /agent {thread_id, question}
                                    │
                                    ▼
                            ┌──────────────┐
                            │  supervisor  │  classifies intent → typed RouteDecision
                            └──────┬───────┘
                                   │
                ┌──────────────────┼──────────────────┐
                ▼                  ▼                  ▼
       ┌────────────────┐  ┌──────────────┐  ┌───────────────┐
       │  policy_agent  │  │  ops_agent   │  │ general_agent │
       │  (RAG)         │  │  (calc +     │  │ (LLM-only     │
       │                │  │   refund)    │  │  fallback)    │
       │  ┌──────────┐  │  │              │  │               │
       │  │ Redis    │  │  │  ToolNode    │  │               │
       │  │ cache    │  │  │  + HITL on   │  │               │
       │  │ (2-tier) │  │  │  risky tools │  │               │
       │  └────┬─────┘  │  └──────┬───────┘  └───────┬───────┘
       │       │ miss   │         │                   │
       │       ▼        │         ▼                   ▼
       │  ┌──────────┐  │   tool execution      simple LLM
       │  │ Hybrid   │  │   (interrupts on        response
       │  │ retrieval│  │    submit_refund_       (general
       │  │ BM25+    │  │    request)            knowledge,
       │  │ dense    │  │                         smalltalk)
       │  │ + rerank │  │
       │  └────┬─────┘  │
       │       ▼        │
       │  Structured    │
       │  RagAnswer     │
       │  (Pydantic)    │
       └───────┬────────┘
               │
               └──────────────────┬──────────────────┘
                                  ▼
                         ┌───────────────────┐
                         │   /agent/stream   │  ← SSE structured events
                         │   /agent          │  ← JSON response
                         └─────────┬─────────┘
                                   │
                                   ▼
            ┌────────────────────────────────────────────┐
            │  Cross-cutting:                            │
            │  • Postgres (LangGraph checkpoints)        │
            │  • Redis (cache)                           │
            │  • LangSmith (tracing)                     │
            │  • RAGAS (eval harness, regression detect) │
            │  • AWS Bedrock (Claude Sonnet 4.5)         │
            │  • Ollama fallback (Gemma 4) via env flip  │
            └────────────────────────────────────────────┘

Three specialist agents, one supervisor, one knowledge base, one cache, one eval harness, one streaming endpoint. Built with the patterns you'd actually defend in a senior interview.

Capabilities at a glance

Layer	Component	Step
LLM	Claude Sonnet 4.5 via Bedrock (Gemma 4 fallback via Ollama)	0, 1, 15
Embeddings	Amazon Titan v2 (nomic-embed-text fallback)	2, 15
Vector DB	Chroma with metadata filtering	2, 3
Retrieval	Hybrid BM25 + dense + Cohere/BGE reranker	11
Generation	Structured `RagAnswer` via tool-binding + forced tool choice	4, 17
Agent runtime	LangGraph stateful graphs, three subgraph specialists	7, 18
Tool use	Calculator, KB-search, refund submission with selective HITL	6, 14
Routing	LLM-based intent classification, typed `RouteDecision`	18
State	Postgres checkpoints, multi-turn conversation memory	13
Compression	Auto-summarization of older turns past token threshold	16
Caching	Two-tier Redis cache (exact + semantic, per-specialist policies)	19
Streaming	Server-Sent Events with structured event types	20
Observability	LangSmith tracing with tags, metadata, per-step spans	10
Evaluation	RAGAS harness, golden dataset, regression detection script	12, 18
API	FastAPI with Pydantic validation, lifespan pre-warming	5, 8

Quick start

# 1. Prerequisites
brew install postgresql@16 redis    # or run via Docker
docker run -d --name pg -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=genai -p 5432:5432 postgres:16
docker run -d --name redis -p 6379:6379 redis:7-alpine

# 2. AWS Bedrock setup (one-time)
aws configure                       # standard credentials
# In Bedrock console → Model access → request Claude Sonnet 4.5 + Titan v2

# 3. Environment
cp .env.example .env
# Set: AWS_REGION, BEDROCK_LLM_MODEL, BEDROCK_EMBED_MODEL,
#      POSTGRES_DSN, REDIS_URL, LANGCHAIN_API_KEY, COHERE_API_KEY

# 4. Python
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 5. Index the knowledge base
python scripts/ingest.py

# 6. Run
uvicorn app.main:app --reload --port 8000

Then in another terminal:

# Single-turn agent invocation
curl -X POST http://localhost:8000/agent \
  -H "Content-Type: application/json" \
  -d '{"thread_id": "demo", "question": "What is the refund window?"}'

# Streaming with structured events
curl -N -X POST http://localhost:8000/agent/stream \
  -H "Content-Type: application/json" \
  -d '{"thread_id": "demo2", "question": "Compare SmartHub Pro and Lite"}'

# Browser demo for streaming UX
open http://localhost:8000/static/stream_demo.html

# Run the eval harness
python scripts/run_eval.py --tag baseline
python scripts/eval_check.py        # CI-style regression gate

For the fallback path (no AWS, runs entirely on your laptop):

LLM_PROVIDER=ollama uvicorn app.main:app --reload

Ollama with Gemma 4 + nomic-embed-text + Chroma still works for everything except the cloud-specific bits.

Why each step is in the order it is

Step ordering is deliberate. Each one is the right next problem given what came before:

   FOUNDATIONS (0-8)
   Setup → bare LLM → vector store → ingest → naive RAG
        → FastAPI → tools → agent graph → endpoint
            │
   OPERATIONAL DEPTH (10-12)
   ↓ Observability before refactoring (you'll need traces to verify
   │ the next changes worked)
   ↓ Hybrid + reranker (the production retrieval pattern)
   ↓ RAGAS eval (now changes have a measurable answer)
            │
   STATE & SAFETY (13-14)
   ↓ Postgres checkpoints + multi-turn (real chatbot UX)
   ↓ Selective HITL (safe to ship destructive tools)
            │
   PRODUCTION POSTURE (15-17)
   ↓ Bedrock migration (production-grade LLM/embeddings)
   ↓ Conversation summarization (bounds token cost)
   ↓ Structured outputs (typed contracts on boundaries)
            │
   ARCHITECTURE & UX (18-20)
   ↓ Multi-agent supervisor (capability scales without prompt bloat)
   ↓ Two-tier caching (cost + latency win)
   ↓ Streaming with structured events (production UX)

This isn't the order most tutorials teach, but it's the order a senior engineer would actually build it. Observability before optimization. Eval before refactor. Foundations stable before adding capability. Production posture before deployment.

The 20 steps

#	Step	What it adds	Key concept
0	Setup	Ollama + project skeleton + config	Tooling foundation
1	LLM baseline	`ChatOllama`, prompt templates	Provider abstraction
2	Vector store	Chroma + embeddings + metadata filter	Semantic search
3	Ingestion	Recursive chunking, metadata tagging	Chunking strategy
4	Naive RAG	LCEL retrieve → prompt → generate	The RAG primitive
5	FastAPI surface	Endpoints, Pydantic validation, lifespan	Service shape
6	Tools	`@tool`, schema generation, RAG-as-tool	Function-calling
7	LangGraph agent	`StateGraph`, conditional edges, ReAct loop	Stateful graphs
8	Agent endpoint	End-to-end three-tier API	Composition
10	Observability	LangSmith traces with tags + metadata	Distributed tracing for LLMs
11	Hybrid + rerank	BM25 + dense ensemble + Cohere/BGE	Production retrieval
12	Eval harness	RAGAS metrics, golden dataset, regression script	Measurement
13	Checkpoints	Postgres-backed multi-turn state	Stateful agents
14	HITL	Selective `interrupt()` on risky tools, approve/reject/edit	Production safety
15	Bedrock	Provider switching (Bedrock ↔ Ollama via env var)	Multi-provider posture
16	Summarization	Token-budget-triggered compression with `RemoveMessage`	Custom reducers
17	Structured outputs	Manual tool-binding, forced tool choice, Pydantic boundary	Typed contracts
18	Supervisor	Three specialist subgraphs + LLM-routed dispatch	Multi-agent architecture
19	Caching	Two-tier Redis (exact + semantic), per-specialist policies	Cost/latency wins
20	Streaming	SSE with structured event types	Production UX

(Step 9 is the original "where to next" roadmap; steps 10-20 implement extensions from it.)

What you'll be able to defend in an interview

After working through this, you can articulate:

"How do you debug LLM apps?" → distributed tracing via LangSmith; tags + metadata for filtering; trace-based eval (Step 10)
"How do you measure RAG quality?" → RAGAS harness with faithfulness, relevancy, context-precision, context-recall; golden dataset curation; regression-blocking CI gate (Step 12)
"What's your retrieval setup?" → hybrid BM25+dense via EnsembleRetriever, Cohere reranker on top, measurable improvement vs. dense-only (Step 11)
"How do you handle long conversations?" → Postgres checkpoints by thread_id; auto-summarization past 3K-token threshold using RemoveMessage reducer (Steps 13, 16)
"How do you make destructive actions safe?" → selective interrupt() on RISKY_TOOLS; approve/reject/edit semantics; durable pause via checkpoints (Step 14)
"How do you scale capability without prompt bloat?" → multi-agent supervisor with subgraph specialists; typed RouteDecision contract; per-specialist tool lists and prompts (Step 18)
"How do you cut LLM costs?" → two-tier cache (exact + semantic) with per-specialist TTL policies; threshold tuning against eval set; cache-invalidation strategy (Step 19)
"How do you ship streaming?" → SSE with structured event taxonomy (token, tool_start, cache_hit, etc.); tag-based filtering for selective streaming; X-Accel-Buffering: no for proxy compatibility; cancellation handling (Step 20)
"Why these specific tools/models?" → multi-provider via env-var switch; Bedrock for prod (managed, IAM, regional); Ollama for offline dev; tradeoffs measurable in same eval harness (Step 15)

Each answer is backed by code in this repo and traces in LangSmith. That's the difference between "I read about agents" and "I've built and operated agents."

Repository layout

genai-rag-app/
├── app/
    └── main.py                  ← FastAPI surface
│   ├── agents/
│   │   ├── policy.py            ← RAG specialist subgraph
│   │   ├── ops.py               ← tool-using specialist subgraph
│   │   └── general.py           ← LLM-only fallback
│   ├── agent_graph.py           ← parent graph: supervisor + specialists
│   ├── agent_graph_helpers.py   ← shared HITL/tools logic
│   ├── supervisor.py            ← intent classification node
│   ├── cache.py                 ← two-tier Redis cache
│   ├── checkpointer.py          ← Postgres checkpointer + connection pool
│   ├── streaming.py             ← SSE event-translation layer
│   ├── eval.py                  ← RAGAS harness
│   ├── llm.py                   ← provider-switching factories
│   ├── retrieval.py             ← hybrid + rerank pipeline
│   ├── rag_chain.py             ← structured RAG via tool binding
│   ├── tools.py                 ← @tool definitions
│   ├── schemas.py               ← Pydantic types (RagAnswer, RouteDecision)
│   ├── vector_store.py          ← Chroma wrapper
│   ├── config.py                ← single source of truth for tunables
    └── routes.py                ← FastAPI routes
├── data/
│   ├── docs/                    ← knowledge base source files
│   ├── chroma/                  ← vector index (generated)
│   └── eval/
│       ├── golden.jsonl         ← evaluation dataset
│       └── results/             ← per-run CSVs + aggregate trend
├── scripts/
│   ├── ingest.py                ← chunk + embed + index
│   ├── run_eval.py              ← run RAGAS, write CSV
│   ├── eval_check.py            ← regression detection (CI-style)
│   └── NN_*_smoke.py            ← per-step smoke tests
├── static/
│   └── stream_demo.html         ← browser SSE demo
└── steps/
    └── 00-20.md                 ← the build guide

What's intentionally NOT here

This system is feature-complete from a portfolio standpoint, but production readiness needs more. The honest list:

Authentication / multi-tenancy. thread_id should be derived from authenticated identity, not trusted from the client. Skipped for teaching clarity; called out in Step 13's interview talking points.
Per-tenant data isolation. Production multi-tenant RAG needs per-tenant collections (or at minimum, per-tenant metadata filtering on retrieval). Not built here.
Deployment. No Docker compose, no Lambda/SAM, no Terraform. The architecture is deploy-ready (stateless API, Postgres + Redis externalized, AWS for inference); turning it into a deployment is its own project.
Cost-aware routing. The supervisor uses Sonnet for every classification. Production would route trivially-classifiable queries with a cheaper model (Haiku, or even an embedding-similarity classifier).
Conversation summarization at the parent-graph level. Step 16 wired summarization into the pre-supervisor agent graph; the Step 18 refactor to specialists left it unintegrated. Re-integration is a one-page extension.
Production-grade vector search. Chroma is fine to ~100K chunks. Past that, Qdrant, Pinecone, or pgvector with HNSW are the next step. The retriever interface is the swap boundary; one file changes.
Audit trail beyond LangSmith. HITL approve/reject events should also write to a structured audit log for compliance contexts. Not built here.
Fine-tuning. Out of scope; the agent uses prompting + tool design rather than custom-trained models. For most use cases, that's the right call.

These are the conversations the system gets you to, not gaps that invalidate it.

For learners

Reading order: README → Step 0 → work through sequentially. Don't skip steps. Each one assumes the previous one is working; running into a problem from Step N+2 because you skipped N is a frustrating way to spend an evening.

Per-step structure (every step follows the same template):

Goal — what you're building, in one sentence
What you're building — diagram + file list, before any code
Mental model — why this is the next right thing
Code sections — typed in, not just pasted; the typing helps retention
Verification — concrete test that proves it works
Senior framing — interview talking points specific to that step
Troubleshooting — every gotcha I hit, with the fix

If a step takes you longer than the time estimate, you're probably learning more than the estimate accounted for. That's fine. The point is the depth, not the speed.

For interviewers / engineers reviewing this

The implementation choices that distinguish this from "another RAG tutorial":

Manual tool binding for structured outputs (Step 17) instead of with_structured_output() — teaches the underlying mechanism so you can drop down when the high-level API breaks.
Subgraphs for specialists (Step 18) instead of node functions — independently testable, deployable, and replaceable; mirrors the monolith-to-services architectural shift.
Selective HITL (Step 14) instead of blanket interrupt_before — the production pattern; pause only on tools with side effects.
Custom RemoveMessage-based summarization (Step 16) instead of "just truncate the list" — shows the LangGraph reducer model under the hood.
Two-tier cache with exact-then-semantic (Step 19) instead of one-tier or LangChain's built-in cache — demonstrates the production hierarchy and threshold-tuning discussion.
Tag-based event filtering for streaming (Step 20) instead of streaming everything — prevents leaking supervisor reasoning to users.
Eval harness with regression-detection script (Step 12) instead of one-shot evals — the CI-gate pattern.

Each is a small choice that signals familiarity with what production looks like, not just what tutorials show.

Status

Built and tested against:

Python 3.13
LangChain, LangGraph, langchain-aws
Postgres 16, Redis 7
Claude Sonnet 4.5 via Bedrock (us.anthropic.claude-sonnet-4-5-20250929-v1:0)
Amazon Titan v2 embeddings

If you find a step that's drifted out of date, the troubleshooting section in each step is the first place to check.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
genai-rag-app		genai-rag-app
steps		steps
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Production-Shaped GenAI: A Multi-Agent System Built in 20 Steps

What this system actually does

Capabilities at a glance

Quick start

Why each step is in the order it is

The 20 steps

What you'll be able to defend in an interview

Repository layout

What's intentionally NOT here

For learners

For interviewers / engineers reviewing this

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Production-Shaped GenAI: A Multi-Agent System Built in 20 Steps

What this system actually does

Capabilities at a glance

Quick start

Why each step is in the order it is

The 20 steps

What you'll be able to defend in an interview

Repository layout

What's intentionally NOT here

For learners

For interviewers / engineers reviewing this

Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages