A lightweight, end-to-end RAG pipeline that crawls a website, indexes its content, and answers questions grounded strictly in the crawled pages, with source citations.
Given a starting URL, the system:
- Crawls in-domain pages (up to 30–50) while respecting
robots.txt - Extracts clean main text and removes boilerplate
- Chunks & embeds content (500–1000 chars, small overlap)
- Indexes embeddings into a vector store (ChromaDB)
- Answers questions using hybrid retrieval (semantic + keyword), citing URLs
If the answer is not supported by the crawled content, it clearly responds:
“Not enough information in crawled content.”
# Crawl → Index → Ask
python rag_cli.py pipeline "https://bbc.com" "What does this site cover?"
# Individual steps
python rag_cli.py crawl "https://bbc.com" --max-pages 50 --max-depth 2
python rag_cli.py index --chunk-size 800 --chunk-overlap 100
python rag_cli.py ask "How is air pollution stealing India's sunshine?" --top-k 5- Crawler: BFS crawl within domain, polite delay, respects
robots.txt - Extraction: Heuristic content detection (article tags, text density)
- Chunking: 800 chars + 100 overlap (balance between context & precision)
- Embeddings: Gemini
embedding-001(fallback: deterministic SHA-based) - Indexing: ChromaDB persistent vector store (SQLite backend)
- Retrieval: Hybrid search — BM25 + vector similarity + keyword overlap
- Grounding: Context-only answers with URL citations and refusals
- Prompt Safety: Ignores any instructions found within crawled pages
| Metric | Description |
|---|---|
| Retrieval latency | ~5–20 ms (ChromaDB) |
| Generation latency | ~1–2 s |
| Grounding correctness | Answers strictly from retrieved context |
| Refusal accuracy | Returns “not enough info” when unsupported |
{
"answer": "Air pollution is stealing India's sunshine by sending aerosols into the atmosphere, which dim the Sun's rays. This is a result of rapid urbanization, industrial growth, and land-use changes that have increased fossil fuel use, vehicle emissions, and biomass burning. In winter, smog, temperature inversions, and crop burning contribute to light-scattering aerosols that reduce sunshine hours. These aerosols can persist in the air and affect sunlight [0].",
"sources": [
{
"url": "https://www.bbc.com/news/articles/cr5qygr6d5yo",
"snippet": "Rapid urbanisation, industrial growth and land-use changes drove up fossil fuel use, vehicle emissions and biomass burning, sending aerosols into the atmosphere..."
},
{
"url": "https://www.bbc.com/news/articles/cqjevxvxw9xo",
"snippet": "He is hopeful that India can remain a hub for such work..."
}
],
"timings": {
"embed_ms": 614,
"retrieval_ms": 12,
"generation_ms": 1631,
"total_ms": 2260
}
}
Unanswerable query:
{ "answer": "Not enough information in crawled content." }- LLM: gemini-2.5-flash-lite-preview-06-17 (temperature 0.1)
- Embeddings: Gemini embedding-001 (768-dim)
- Vector DB: ChromaDB (SQLite backend)
- Libraries:
FastAPI,BeautifulSoup4,Requests,NumPy,tqdm - Prompt Template: Context-only QA with enforced citation format
