Skip to content

J-Karthikeyan/rag-service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Service

A lightweight, end-to-end RAG pipeline that crawls a website, indexes its content, and answers questions grounded strictly in the crawled pages, with source citations.

Overview

Given a starting URL, the system:

  1. Crawls in-domain pages (up to 30–50) while respecting robots.txt
  2. Extracts clean main text and removes boilerplate
  3. Chunks & embeds content (500–1000 chars, small overlap)
  4. Indexes embeddings into a vector store (ChromaDB)
  5. Answers questions using hybrid retrieval (semantic + keyword), citing URLs

If the answer is not supported by the crawled content, it clearly responds:

“Not enough information in crawled content.”

Pipeline

# Crawl → Index → Ask
python rag_cli.py pipeline "https://bbc.com" "What does this site cover?"

# Individual steps
python rag_cli.py crawl "https://bbc.com" --max-pages 50 --max-depth 2
python rag_cli.py index --chunk-size 800 --chunk-overlap 100
python rag_cli.py ask "How is air pollution stealing India's sunshine?" --top-k 5

Key Design Decisions

  • Crawler: BFS crawl within domain, polite delay, respects robots.txt
  • Extraction: Heuristic content detection (article tags, text density)
  • Chunking: 800 chars + 100 overlap (balance between context & precision)
  • Embeddings: Gemini embedding-001 (fallback: deterministic SHA-based)
  • Indexing: ChromaDB persistent vector store (SQLite backend)
  • Retrieval: Hybrid search — BM25 + vector similarity + keyword overlap
  • Grounding: Context-only answers with URL citations and refusals
  • Prompt Safety: Ignores any instructions found within crawled pages

Evaluation Metrics

Metric Description
Retrieval latency ~5–20 ms (ChromaDB)
Generation latency ~1–2 s
Grounding correctness Answers strictly from retrieved context
Refusal accuracy Returns “not enough info” when unsupported

Example Output

Question: How is air pollution stealing india's sunshine ?

{
  "answer": "Air pollution is stealing India's sunshine by sending aerosols into the atmosphere, which dim the Sun's rays. This is a result of rapid urbanization, industrial growth, and land-use changes that have increased fossil fuel use, vehicle emissions, and biomass burning. In winter, smog, temperature inversions, and crop burning contribute to light-scattering aerosols that reduce sunshine hours. These aerosols can persist in the air and affect sunlight [0].",

  "sources": [
    {
      "url": "https://www.bbc.com/news/articles/cr5qygr6d5yo",
      "snippet": "Rapid urbanisation, industrial growth and land-use changes drove up fossil fuel use, vehicle emissions and biomass burning, sending aerosols into the atmosphere..."
    },
    {
      "url": "https://www.bbc.com/news/articles/cqjevxvxw9xo",
      "snippet": "He is hopeful that India can remain a hub for such work..."
    }
  ],

  "timings": {
    "embed_ms": 614,
    "retrieval_ms": 12,
    "generation_ms": 1631,
    "total_ms": 2260
  }
}

alt text

Unanswerable query:

{ "answer": "Not enough information in crawled content." }

Tooling & Prompts

  • LLM: gemini-2.5-flash-lite-preview-06-17 (temperature 0.1)
  • Embeddings: Gemini embedding-001 (768-dim)
  • Vector DB: ChromaDB (SQLite backend)
  • Libraries: FastAPI, BeautifulSoup4, Requests, NumPy, tqdm
  • Prompt Template: Context-only QA with enforced citation format

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages