BM25 retrieval service with a prebuilt JSON index, FastAPI endpoints, and basic latency metrics. Dev API key guard; no vectors, reranking, eval suite, or rate limiting (yet).
- Acceptance Criteria
 - Quickstart
 - Endpoints
 - Usage (auth required)
 - Evaluate
 - Benchmarks (local demo)
 - Architecture
 - Deployment (Docker & one-click)
 - Monitoring & Metrics
 - Security (API key & rate limiting)
 - Troubleshooting
 - Notes
 
- Recall@10 ≥ 0.80; Answer F1 ≥ 0.70 (or EM ≥ 0.60)
 - p95 latency ≤ 800 ms (≥100 queries); p50 ≤ 300 ms
 - Cost/1k queries within budget; cache hit-rate ≥ 30%
 - API-key auth + rate limiting
 - Docker + one-click deploy (Render/Fly/Cloud Run)
 - README benchmarks table + Loom demo
 
conda create -n rag_env python=3.11 -y
conda activate rag_env
pip install -r requirements.txt
# build the BM25 index
python -m rag_app.index --corpus ./corpus --out ./rag_app/index.json
# run (auth + rate limit)
API_KEY=dev-key RATE_LIMIT_PER_MIN=30 \
python -m uvicorn rag_app.main:app --host 0.0.0.0 --port 8010python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# build the BM25 index
python -m rag_app.index --corpus ./corpus --out ./rag_app/index.json
# run (auth + rate limit)
API_KEY=dev-key RATE_LIMIT_PER_MIN=30 \
uvicorn rag_app.main:app --host 0.0.0.0 --port 8010API_KEY=dev-key RATE_LIMIT_PER_MIN=30 \
python -m rag_app.index --corpus ./corpus --out ./rag_app/index.json && \
python -m uvicorn rag_app.main:app --host 0.0.0.0 --port 8010GET /health→{"ok": true, "version": "..." }GET /version→{"version": "..." }GET /metrics→{"requests": n, "latency_ms_p50": ..., "latency_ms_p95": ..., "window": n}POST /ask→{ "answer": "...", "latency_ms": 0.0, "docs": [ { "doc_id": "...", "text": "...", "score": ... } ] }
Set your base URL and API key (local example shown):
export BASE_URL=http://localhost:8010
export API_KEY=dev-keyAuthorized request (200):
curl -s -X POST "$BASE_URL/ask" \
  -H "x-api-key: $API_KEY" -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":5}' | python -m json.toolUnauthorized example (should be 401):
curl -i -s -X POST "$BASE_URL/ask" \
  -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":5}'Metrics:
curl -s "$BASE_URL/metrics" | python -m json.toolPlace 50 Q/A pairs in eval/gold.jsonl:
{"question":"...", "answer":"..."}Run the evaluator:
python -m eval.evaluate --gold ./eval/gold.jsonl --api http://localhost:8010/ask --k 5| Metric | Value | 
|---|---|
| Answer F1 | 1.00 (toy) | 
| Recall@10 | 1.00 (toy) | 
| p50 latency | 0.199 ms (local) | 
| p95 latency | 0.355 ms (local) | 
flowchart TD
  A[Client] -->|HTTP| S[FastAPI Service]
  subgraph Routes
    S --> R1["POST /ask"]
    S --> R2["GET /health"]
    S --> R3["GET /version"]
    S --> R4["GET /metrics"]
  end
  R1 --> G{API key valid?}
  G -- No --> E401[401 Unauthorized]
  G -- Yes --> L{Within rate limit?}
  L -- No --> E429[429 Rate Limited]
  L -- Yes --> Q[BM25 query k]
  Q --> IDX[Index JSON]
  Q --> B[Stopword-aware boost]
  B --> SSEL[Best sentence]
  SSEL --> CAN[Canonical phrasing]
  CAN --> RESP[Response: answer/docs/latency_ms]
  R1 -. on success .-> MREC[Record latency]
  MREC --> R4
 
  subgraph Build
    C1[corpus txt files] --> IDX
    C2[python -m rag_app.index] --> IDX
  end
    - API layer: 
rag_app/main.py(FastAPI app, routes, request/response models). - Auth: Simple API-key via 
x-api-keyheader. Disabled ifAPI_KEYenv is unset/empty. - Rate limiting: In-memory token bucket per key (
RATE_LIMIT_PER_MIN), thread-safe. - Retrieval: 
rag_app/retrieval.pywithBM25Retrieverover a JSON index. - Index build: 
rag_app/index.pysplitscorpus/*.txtinto snippets → writesrag_app/index.json. - Answering: Stopword-aware boost, choose best sentence from top snippet, then optional canonical phrasing for known intents.
 - Metrics: In-memory deque of recent latencies (p50/p95) + request count, exposed at 
/metrics. - (Optional) Cache: Small in-memory LRU for repeated 
(question,k)lookups. 
- Guard: Check 
x-api-key(ifAPI_KEYis set) and rate limit the caller. - Retrieve: Query BM25 over 
rag_app/index.json(top-k). - Re-rank: Apply stopword-aware term-match boost to prioritize relevant snippets.
 - Answer pick: Choose the best sentence from the top snippet; if the question matches a known intent, apply canonical phrasing.
 - Metrics: Record latency (ms) into a rolling window (default 5k requests).
 - Respond: Return 
{answer, latency_ms, docs}. 
- Corpus: Plain text files under 
corpus/. Edit or replace for your domain. - Index artifact: 
rag_app/index.json(generated). Treat as a build artifact; ignore in git.- Build at image build time (Docker) or at container start if missing.
 
 
API_KEY– enables auth when set (e.g.,dev-keyfor local).RATE_LIMIT_PER_MIN– integer per-key budget (default60).- (If you add caching) 
CACHE_TTL_S,CACHE_MAX. 
rag_app/
├─ main.py         # FastAPI app, routes, auth, limiter, metrics, answering
├─ retrieval.py    # BM25Retriever (loads/snaps index)
├─ index.py        # builds JSON index from corpus/*.txt
└─ index.json      # generated artifact (ignored in VCS)
eval/
└─ evaluate.py     # computes F1/Recall@k via API calls
corpus/
└─ *.txt           # domain text
Uses Dockerfile. Set
API_KEYin Render env vars after deploy.
# build
docker build -t rag-service .
# run (maps 8000->8000 in the container)
docker run --rm -p 8000:8000 \
  -e API_KEY=dev-key \
  -e RATE_LIMIT_PER_MIN=60 \
  rag-serviceSet BASE_URL=http://localhost:8000 when testing the container.
GET /metrics→ JSON:{ "requests": 42, "latency_ms_p50": 1.23, "latency_ms_p95": 3.45, "window": 42, "version": "..." }window= number of recent requests kept in memory (rolling window).- Values reset on process restart (in-memory).
 
# Pretty print
curl -s "$BASE_URL/metrics" | python -m json.tool
# Print just key numbers (quote-safe)
curl -s "$BASE_URL/metrics" \
| python -c 'import sys,json; d=json.load(sys.stdin); print("requests={}  p50={} ms  p95={} ms".format(d["requests"], d["latency_ms_p50"], d["latency_ms_p95"]))'pip install prometheus-fastapi-instrumentator# rag_app/main.py
from prometheus_fastapi_instrumentator import Instrumentator
@app.on_event("startup")
def _startup():
    Instrumentator().instrument(app).expose(app, endpoint="/metrics/prom")- Header: 
x-api-key: <YOUR_KEY> - Enabled when 
API_KEYenv var is set (any non-empty string). - Disabled in dev if 
API_KEYis empty. 
Examples
# Authorized (200)
curl -s -X POST "$BASE_URL/ask" \
  -H "x-api-key: $API_KEY" -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":5}'
# Unauthorized (401)
curl -i -s -X POST "$BASE_URL/ask" \
  -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":5}'- In-memory token bucket per API key.
 - Budget per minute: 
RATE_LIMIT_PER_MIN(default 60). - Exceeds budget → 429 Too Many Requests.
 - For multi-replica deployments, move buckets to Redis (shared state).
 
Set limits
API_KEY=<strong-secret> RATE_LIMIT_PER_MIN=60 \
uvicorn rag_app.main:app --host 0.0.0.0 --port 8010Best practices
- Use different keys per environment (dev/stage/prod).
 - Rotate keys; never commit them.
 - Front with a gateway/WAF if exposed publicly.
 - Add CORS policy if you’ll call from a browser app.
 
| Symptom | Likely cause | Fix | 
|---|---|---|
401 Unauthorized on /ask | 
Missing/incorrect x-api-key or API_KEY not set on server | 
Set API_KEY server-side and send x-api-key header. Test curl -s $BASE_URL/health. | 
429 Too Many Requests | 
Rate limit exceeded | Lower request rate, increase RATE_LIMIT_PER_MIN, or use separate keys for tests. | 
404 Not Found on /version or /ask | 
Wrong app path or port | Ensure you run rag_app.main:app and target the right port. List paths via /openapi.json. | 
| Port already in use | Old server still running | ss -lptn 'sport = :8010' then kill the PID, or change --port. | 
uvicorn: command not found | 
Not installed in current env | pip install uvicorn[standard]; confirm with which python / which uvicorn. | 
ModuleNotFoundError: rag_app | 
Wrong cwd / PYTHONPATH | Run from repo root or set PYTHONPATH=.; uvicorn rag_app.main:app .... | 
| Index missing at startup | rag_app/index.json not built | 
Run python -m rag_app.index --corpus ./corpus --out ./rag_app/index.json. | 
/metrics shows zeros | 
Fresh process or no traffic | Send a few /ask requests, then recheck. | 
| JSON errors in CLI snippets | F-string quoting | Use the .format() example in Monitoring section. | 
| Docker healthcheck failing | Wrong port or env | Container listens on $PORT (default 8000). Map and set API_KEY. | 
Diagnostics
# List routes
curl -s "$BASE_URL/openapi.json" | python -m json.tool
# Health/version
curl -s "$BASE_URL/health"; curl -s "$BASE_URL/version"
# Minimal POST
curl -s -X POST "$BASE_URL/ask" \
  -H "x-api-key: $API_KEY" -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":3}' | python -m json.tool- Start with BM25 baseline (rank_bm25), then add vectors + reranker as needed.
 - Consider a small LRU cache for repeated queries and structured logging for observability.
 
MIT — see LICENSE.
Questions? Open an issue or ping me on LinkedIn.