Skip to content

sunilsm7/gen-ai-rag-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Production-Shaped GenAI: A Multi-Agent System Built in 20 Steps

A complete, runnable, multi-agent RAG application demonstrating the production patterns senior GenAI engineers ship: hybrid retrieval, agentic tool use, human-in-the-loop, multi-turn state with persistence, structured outputs, multi-agent orchestration, two-tier caching, structured event streaming, and a measurement harness on top of all of it.

Built incrementally across 20 focused steps β€” each one introduces one concept, produces something runnable, and ties back to senior-level interview talking points.


What this system actually does

                       POST /agent {thread_id, question}
                                    β”‚
                                    β–Ό
                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                            β”‚  supervisor  β”‚  classifies intent β†’ typed RouteDecision
                            β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β–Ό                  β–Ό                  β–Ό
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚  policy_agent  β”‚  β”‚  ops_agent   β”‚  β”‚ general_agent β”‚
       β”‚  (RAG)         β”‚  β”‚  (calc +     β”‚  β”‚ (LLM-only     β”‚
       β”‚                β”‚  β”‚   refund)    β”‚  β”‚  fallback)    β”‚
       β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚              β”‚  β”‚               β”‚
       β”‚  β”‚ Redis    β”‚  β”‚  β”‚  ToolNode    β”‚  β”‚               β”‚
       β”‚  β”‚ cache    β”‚  β”‚  β”‚  + HITL on   β”‚  β”‚               β”‚
       β”‚  β”‚ (2-tier) β”‚  β”‚  β”‚  risky tools β”‚  β”‚               β”‚
       β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚       β”‚ miss   β”‚         β”‚                   β”‚
       β”‚       β–Ό        β”‚         β–Ό                   β–Ό
       β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   tool execution      simple LLM
       β”‚  β”‚ Hybrid   β”‚  β”‚   (interrupts on        response
       β”‚  β”‚ retrievalβ”‚  β”‚    submit_refund_       (general
       β”‚  β”‚ BM25+    β”‚  β”‚    request)            knowledge,
       β”‚  β”‚ dense    β”‚  β”‚                         smalltalk)
       β”‚  β”‚ + rerank β”‚  β”‚
       β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚
       β”‚       β–Ό        β”‚
       β”‚  Structured    β”‚
       β”‚  RagAnswer     β”‚
       β”‚  (Pydantic)    β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β–Ό
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚   /agent/stream   β”‚  ← SSE structured events
                         β”‚   /agent          β”‚  ← JSON response
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                                   β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚  Cross-cutting:                            β”‚
            β”‚  β€’ Postgres (LangGraph checkpoints)        β”‚
            β”‚  β€’ Redis (cache)                           β”‚
            β”‚  β€’ LangSmith (tracing)                     β”‚
            β”‚  β€’ RAGAS (eval harness, regression detect) β”‚
            β”‚  β€’ AWS Bedrock (Claude Sonnet 4.5)         β”‚
            β”‚  β€’ Ollama fallback (Gemma 4) via env flip  β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Three specialist agents, one supervisor, one knowledge base, one cache, one eval harness, one streaming endpoint. Built with the patterns you'd actually defend in a senior interview.


Capabilities at a glance

Layer Component Step
LLM Claude Sonnet 4.5 via Bedrock (Gemma 4 fallback via Ollama) 0, 1, 15
Embeddings Amazon Titan v2 (nomic-embed-text fallback) 2, 15
Vector DB Chroma with metadata filtering 2, 3
Retrieval Hybrid BM25 + dense + Cohere/BGE reranker 11
Generation Structured RagAnswer via tool-binding + forced tool choice 4, 17
Agent runtime LangGraph stateful graphs, three subgraph specialists 7, 18
Tool use Calculator, KB-search, refund submission with selective HITL 6, 14
Routing LLM-based intent classification, typed RouteDecision 18
State Postgres checkpoints, multi-turn conversation memory 13
Compression Auto-summarization of older turns past token threshold 16
Caching Two-tier Redis cache (exact + semantic, per-specialist policies) 19
Streaming Server-Sent Events with structured event types 20
Observability LangSmith tracing with tags, metadata, per-step spans 10
Evaluation RAGAS harness, golden dataset, regression detection script 12, 18
API FastAPI with Pydantic validation, lifespan pre-warming 5, 8

Quick start

# 1. Prerequisites
brew install postgresql@16 redis    # or run via Docker
docker run -d --name pg -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=genai -p 5432:5432 postgres:16
docker run -d --name redis -p 6379:6379 redis:7-alpine

# 2. AWS Bedrock setup (one-time)
aws configure                       # standard credentials
# In Bedrock console β†’ Model access β†’ request Claude Sonnet 4.5 + Titan v2

# 3. Environment
cp .env.example .env
# Set: AWS_REGION, BEDROCK_LLM_MODEL, BEDROCK_EMBED_MODEL,
#      POSTGRES_DSN, REDIS_URL, LANGCHAIN_API_KEY, COHERE_API_KEY

# 4. Python
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 5. Index the knowledge base
python scripts/ingest.py

# 6. Run
uvicorn app.main:app --reload --port 8000

Then in another terminal:

# Single-turn agent invocation
curl -X POST http://localhost:8000/agent \
  -H "Content-Type: application/json" \
  -d '{"thread_id": "demo", "question": "What is the refund window?"}'

# Streaming with structured events
curl -N -X POST http://localhost:8000/agent/stream \
  -H "Content-Type: application/json" \
  -d '{"thread_id": "demo2", "question": "Compare SmartHub Pro and Lite"}'

# Browser demo for streaming UX
open http://localhost:8000/static/stream_demo.html

# Run the eval harness
python scripts/run_eval.py --tag baseline
python scripts/eval_check.py        # CI-style regression gate

For the fallback path (no AWS, runs entirely on your laptop):

LLM_PROVIDER=ollama uvicorn app.main:app --reload

Ollama with Gemma 4 + nomic-embed-text + Chroma still works for everything except the cloud-specific bits.


Why each step is in the order it is

Step ordering is deliberate. Each one is the right next problem given what came before:

   FOUNDATIONS (0-8)
   Setup β†’ bare LLM β†’ vector store β†’ ingest β†’ naive RAG
        β†’ FastAPI β†’ tools β†’ agent graph β†’ endpoint
            β”‚
   OPERATIONAL DEPTH (10-12)
   ↓ Observability before refactoring (you'll need traces to verify
   β”‚ the next changes worked)
   ↓ Hybrid + reranker (the production retrieval pattern)
   ↓ RAGAS eval (now changes have a measurable answer)
            β”‚
   STATE & SAFETY (13-14)
   ↓ Postgres checkpoints + multi-turn (real chatbot UX)
   ↓ Selective HITL (safe to ship destructive tools)
            β”‚
   PRODUCTION POSTURE (15-17)
   ↓ Bedrock migration (production-grade LLM/embeddings)
   ↓ Conversation summarization (bounds token cost)
   ↓ Structured outputs (typed contracts on boundaries)
            β”‚
   ARCHITECTURE & UX (18-20)
   ↓ Multi-agent supervisor (capability scales without prompt bloat)
   ↓ Two-tier caching (cost + latency win)
   ↓ Streaming with structured events (production UX)

This isn't the order most tutorials teach, but it's the order a senior engineer would actually build it. Observability before optimization. Eval before refactor. Foundations stable before adding capability. Production posture before deployment.


The 20 steps

# Step What it adds Key concept
0 Setup Ollama + project skeleton + config Tooling foundation
1 LLM baseline ChatOllama, prompt templates Provider abstraction
2 Vector store Chroma + embeddings + metadata filter Semantic search
3 Ingestion Recursive chunking, metadata tagging Chunking strategy
4 Naive RAG LCEL retrieve β†’ prompt β†’ generate The RAG primitive
5 FastAPI surface Endpoints, Pydantic validation, lifespan Service shape
6 Tools @tool, schema generation, RAG-as-tool Function-calling
7 LangGraph agent StateGraph, conditional edges, ReAct loop Stateful graphs
8 Agent endpoint End-to-end three-tier API Composition
10 Observability LangSmith traces with tags + metadata Distributed tracing for LLMs
11 Hybrid + rerank BM25 + dense ensemble + Cohere/BGE Production retrieval
12 Eval harness RAGAS metrics, golden dataset, regression script Measurement
13 Checkpoints Postgres-backed multi-turn state Stateful agents
14 HITL Selective interrupt() on risky tools, approve/reject/edit Production safety
15 Bedrock Provider switching (Bedrock ↔ Ollama via env var) Multi-provider posture
16 Summarization Token-budget-triggered compression with RemoveMessage Custom reducers
17 Structured outputs Manual tool-binding, forced tool choice, Pydantic boundary Typed contracts
18 Supervisor Three specialist subgraphs + LLM-routed dispatch Multi-agent architecture
19 Caching Two-tier Redis (exact + semantic), per-specialist policies Cost/latency wins
20 Streaming SSE with structured event types Production UX

(Step 9 is the original "where to next" roadmap; steps 10-20 implement extensions from it.)


What you'll be able to defend in an interview

After working through this, you can articulate:

  • "How do you debug LLM apps?" β†’ distributed tracing via LangSmith; tags + metadata for filtering; trace-based eval (Step 10)
  • "How do you measure RAG quality?" β†’ RAGAS harness with faithfulness, relevancy, context-precision, context-recall; golden dataset curation; regression-blocking CI gate (Step 12)
  • "What's your retrieval setup?" β†’ hybrid BM25+dense via EnsembleRetriever, Cohere reranker on top, measurable improvement vs. dense-only (Step 11)
  • "How do you handle long conversations?" β†’ Postgres checkpoints by thread_id; auto-summarization past 3K-token threshold using RemoveMessage reducer (Steps 13, 16)
  • "How do you make destructive actions safe?" β†’ selective interrupt() on RISKY_TOOLS; approve/reject/edit semantics; durable pause via checkpoints (Step 14)
  • "How do you scale capability without prompt bloat?" β†’ multi-agent supervisor with subgraph specialists; typed RouteDecision contract; per-specialist tool lists and prompts (Step 18)
  • "How do you cut LLM costs?" β†’ two-tier cache (exact + semantic) with per-specialist TTL policies; threshold tuning against eval set; cache-invalidation strategy (Step 19)
  • "How do you ship streaming?" β†’ SSE with structured event taxonomy (token, tool_start, cache_hit, etc.); tag-based filtering for selective streaming; X-Accel-Buffering: no for proxy compatibility; cancellation handling (Step 20)
  • "Why these specific tools/models?" β†’ multi-provider via env-var switch; Bedrock for prod (managed, IAM, regional); Ollama for offline dev; tradeoffs measurable in same eval harness (Step 15)

Each answer is backed by code in this repo and traces in LangSmith. That's the difference between "I read about agents" and "I've built and operated agents."


Repository layout

genai-rag-app/
β”œβ”€β”€ app/
    └── main.py                  ← FastAPI surface
β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   β”œβ”€β”€ policy.py            ← RAG specialist subgraph
β”‚   β”‚   β”œβ”€β”€ ops.py               ← tool-using specialist subgraph
β”‚   β”‚   └── general.py           ← LLM-only fallback
β”‚   β”œβ”€β”€ agent_graph.py           ← parent graph: supervisor + specialists
β”‚   β”œβ”€β”€ agent_graph_helpers.py   ← shared HITL/tools logic
β”‚   β”œβ”€β”€ supervisor.py            ← intent classification node
β”‚   β”œβ”€β”€ cache.py                 ← two-tier Redis cache
β”‚   β”œβ”€β”€ checkpointer.py          ← Postgres checkpointer + connection pool
β”‚   β”œβ”€β”€ streaming.py             ← SSE event-translation layer
β”‚   β”œβ”€β”€ eval.py                  ← RAGAS harness
β”‚   β”œβ”€β”€ llm.py                   ← provider-switching factories
β”‚   β”œβ”€β”€ retrieval.py             ← hybrid + rerank pipeline
β”‚   β”œβ”€β”€ rag_chain.py             ← structured RAG via tool binding
β”‚   β”œβ”€β”€ tools.py                 ← @tool definitions
β”‚   β”œβ”€β”€ schemas.py               ← Pydantic types (RagAnswer, RouteDecision)
β”‚   β”œβ”€β”€ vector_store.py          ← Chroma wrapper
β”‚   β”œβ”€β”€ config.py                ← single source of truth for tunables
    └── routes.py                ← FastAPI routes
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ docs/                    ← knowledge base source files
β”‚   β”œβ”€β”€ chroma/                  ← vector index (generated)
β”‚   └── eval/
β”‚       β”œβ”€β”€ golden.jsonl         ← evaluation dataset
β”‚       └── results/             ← per-run CSVs + aggregate trend
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ ingest.py                ← chunk + embed + index
β”‚   β”œβ”€β”€ run_eval.py              ← run RAGAS, write CSV
β”‚   β”œβ”€β”€ eval_check.py            ← regression detection (CI-style)
β”‚   └── NN_*_smoke.py            ← per-step smoke tests
β”œβ”€β”€ static/
β”‚   └── stream_demo.html         ← browser SSE demo
└── steps/
    └── 00-20.md                 ← the build guide

What's intentionally NOT here

This system is feature-complete from a portfolio standpoint, but production readiness needs more. The honest list:

  • Authentication / multi-tenancy. thread_id should be derived from authenticated identity, not trusted from the client. Skipped for teaching clarity; called out in Step 13's interview talking points.
  • Per-tenant data isolation. Production multi-tenant RAG needs per-tenant collections (or at minimum, per-tenant metadata filtering on retrieval). Not built here.
  • Deployment. No Docker compose, no Lambda/SAM, no Terraform. The architecture is deploy-ready (stateless API, Postgres + Redis externalized, AWS for inference); turning it into a deployment is its own project.
  • Cost-aware routing. The supervisor uses Sonnet for every classification. Production would route trivially-classifiable queries with a cheaper model (Haiku, or even an embedding-similarity classifier).
  • Conversation summarization at the parent-graph level. Step 16 wired summarization into the pre-supervisor agent graph; the Step 18 refactor to specialists left it unintegrated. Re-integration is a one-page extension.
  • Production-grade vector search. Chroma is fine to ~100K chunks. Past that, Qdrant, Pinecone, or pgvector with HNSW are the next step. The retriever interface is the swap boundary; one file changes.
  • Audit trail beyond LangSmith. HITL approve/reject events should also write to a structured audit log for compliance contexts. Not built here.
  • Fine-tuning. Out of scope; the agent uses prompting + tool design rather than custom-trained models. For most use cases, that's the right call.

These are the conversations the system gets you to, not gaps that invalidate it.


For learners

Reading order: README β†’ Step 0 β†’ work through sequentially. Don't skip steps. Each one assumes the previous one is working; running into a problem from Step N+2 because you skipped N is a frustrating way to spend an evening.

Per-step structure (every step follows the same template):

  1. Goal β€” what you're building, in one sentence
  2. What you're building β€” diagram + file list, before any code
  3. Mental model β€” why this is the next right thing
  4. Code sections β€” typed in, not just pasted; the typing helps retention
  5. Verification β€” concrete test that proves it works
  6. Senior framing β€” interview talking points specific to that step
  7. Troubleshooting β€” every gotcha I hit, with the fix

If a step takes you longer than the time estimate, you're probably learning more than the estimate accounted for. That's fine. The point is the depth, not the speed.


For interviewers / engineers reviewing this

The implementation choices that distinguish this from "another RAG tutorial":

  • Manual tool binding for structured outputs (Step 17) instead of with_structured_output() β€” teaches the underlying mechanism so you can drop down when the high-level API breaks.
  • Subgraphs for specialists (Step 18) instead of node functions β€” independently testable, deployable, and replaceable; mirrors the monolith-to-services architectural shift.
  • Selective HITL (Step 14) instead of blanket interrupt_before β€” the production pattern; pause only on tools with side effects.
  • Custom RemoveMessage-based summarization (Step 16) instead of "just truncate the list" β€” shows the LangGraph reducer model under the hood.
  • Two-tier cache with exact-then-semantic (Step 19) instead of one-tier or LangChain's built-in cache β€” demonstrates the production hierarchy and threshold-tuning discussion.
  • Tag-based event filtering for streaming (Step 20) instead of streaming everything β€” prevents leaking supervisor reasoning to users.
  • Eval harness with regression-detection script (Step 12) instead of one-shot evals β€” the CI-gate pattern.

Each is a small choice that signals familiarity with what production looks like, not just what tutorials show.


Status

Built and tested against:

  • Python 3.13
  • LangChain, LangGraph, langchain-aws
  • Postgres 16, Redis 7
  • Claude Sonnet 4.5 via Bedrock (us.anthropic.claude-sonnet-4-5-20250929-v1:0)
  • Amazon Titan v2 embeddings

If you find a step that's drifted out of date, the troubleshooting section in each step is the first place to check.

About

Production-Shaped GenAI: A Multi-Agent System Built in 20 Steps

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors