RAG Service

A lightweight, end-to-end RAG pipeline that crawls a website, indexes its content, and answers questions grounded strictly in the crawled pages, with source citations.

Overview

Given a starting URL, the system:

Crawls in-domain pages (up to 30–50) while respecting robots.txt
Extracts clean main text and removes boilerplate
Chunks & embeds content (500–1000 chars, small overlap)
Indexes embeddings into a vector store (ChromaDB)
Answers questions using hybrid retrieval (semantic + keyword), citing URLs

If the answer is not supported by the crawled content, it clearly responds:

“Not enough information in crawled content.”

Pipeline

# Crawl → Index → Ask
python rag_cli.py pipeline "https://bbc.com" "What does this site cover?"

# Individual steps
python rag_cli.py crawl "https://bbc.com" --max-pages 50 --max-depth 2
python rag_cli.py index --chunk-size 800 --chunk-overlap 100
python rag_cli.py ask "How is air pollution stealing India's sunshine?" --top-k 5

Key Design Decisions

Crawler: BFS crawl within domain, polite delay, respects robots.txt
Extraction: Heuristic content detection (article tags, text density)
Chunking: 800 chars + 100 overlap (balance between context & precision)
Embeddings: Gemini embedding-001 (fallback: deterministic SHA-based)
Indexing: ChromaDB persistent vector store (SQLite backend)
Retrieval: Hybrid search — BM25 + vector similarity + keyword overlap
Grounding: Context-only answers with URL citations and refusals
Prompt Safety: Ignores any instructions found within crawled pages

Evaluation Metrics

Metric	Description
Retrieval latency	~5–20 ms (ChromaDB)
Generation latency	~1–2 s
Grounding correctness	Answers strictly from retrieved context
Refusal accuracy	Returns “not enough info” when unsupported

Example Output

Question: How is air pollution stealing india's sunshine ?

{
  "answer": "Air pollution is stealing India's sunshine by sending aerosols into the atmosphere, which dim the Sun's rays. This is a result of rapid urbanization, industrial growth, and land-use changes that have increased fossil fuel use, vehicle emissions, and biomass burning. In winter, smog, temperature inversions, and crop burning contribute to light-scattering aerosols that reduce sunshine hours. These aerosols can persist in the air and affect sunlight [0].",

  "sources": [
    {
      "url": "https://www.bbc.com/news/articles/cr5qygr6d5yo",
      "snippet": "Rapid urbanisation, industrial growth and land-use changes drove up fossil fuel use, vehicle emissions and biomass burning, sending aerosols into the atmosphere..."
    },
    {
      "url": "https://www.bbc.com/news/articles/cqjevxvxw9xo",
      "snippet": "He is hopeful that India can remain a hub for such work..."
    }
  ],

  "timings": {
    "embed_ms": 614,
    "retrieval_ms": 12,
    "generation_ms": 1631,
    "total_ms": 2260
  }
}

Unanswerable query:

{ "answer": "Not enough information in crawled content." }

Tooling & Prompts

LLM: gemini-2.5-flash-lite-preview-06-17 (temperature 0.1)
Embeddings: Gemini embedding-001 (768-dim)
Vector DB: ChromaDB (SQLite backend)
Libraries: FastAPI, BeautifulSoup4, Requests, NumPy, tqdm
Prompt Template: Context-only QA with enforced citation format

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
README.md		README.md
image.png		image.png
rag_cli.py		rag_cli.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Service

Overview

Pipeline

Key Design Decisions

Evaluation Metrics

Example Output

Question: How is air pollution stealing india's sunshine ?

Tooling & Prompts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Service

Overview

Pipeline

Key Design Decisions

Evaluation Metrics

Example Output

Question: How is air pollution stealing india's sunshine ?

Tooling & Prompts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages