Skip to content

AnkitNayak-eth/EpsteinFiles-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EpsteinFiles-RAG

A RAG pipeline implementation built on the 'Epstein Files 20K' dataset from Hugging Face (Teyler).

Recording 2026-02-10 230408

Dataset source

👉 https://huggingface.co/datasets/teyler/epstein-files-20k

Precomputed Embeddings

👉 https://huggingface.co/datasets/devankit7873/EpsteinFiles-Vector-Embeddings-ChromaDB


⚡ Quick Demo

Process 2M+ document lines → Get accurate, grounded answers in seconds

What it does:

  • Automatically cleans and reconstructs fragmented documents
  • Intelligently chunks documents while preserving context
  • Embeds everything into a searchable vector database
  • Retrieves diverse, relevant information using MMR algorithm
  • Generates answers grounded solely in the retrieved context

🎯 Key Features

No Hallucinations - Answers only from source documents
Intelligent Retrieval - MMR algorithm for diverse results
Fast Processing - ~1 second end-to-end query response
Semantic Understanding - Context-aware document chunking
REST API - Easy integration with other systems
Interactive UI - Streamlit web interface included
Scalable - Handles 100K+ document chunks
Production-Ready - Async support, error handling, logging


🏗️ How It Works

Three Simple Stages

Stage 1: Data Preparation

Raw Documents (2.5M lines)
    ↓
Clean & Reconstruct
    ↓
Smart Chunking
    ↓
Vector Embeddings

Stage 2: Intelligent Retrieval

User Question
    ↓
Find Similar Context (MMR)
    ↓
Return Top Chunks

Stage 3: Grounded Answer

Context + Question
    ↓
LLaMA 3.3 LLM
    ↓
Grounded Answer (with sources)

Why MMR Instead of Similarity?

Previous Approach: Pure semantic similarity
→ Returned redundant chunks from same document

Current Approach: Maximal Marginal Relevance (MMR)
→ Balances relevance + diversity for comprehensive context


📦 Installation

Requirements

Setup (5 minutes)

1. Clone repository

git clone https://github.com/AnkitNayak-eth/EpsteinFiles-RAG.git
cd EpsteinFiles-RAG

2. Create virtual environment

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Configure environment

Create .env file:

GROQ_API_KEY=your_api_key_here

🚀 Getting Started

Run Complete Pipeline (First Time)

This processes data and prepares the system for queries:

# Stage 1: Download raw data (~5-15 min)
python ingest/download_dataset.py

# Stage 2: Clean and reconstruct documents (~3-8 min)
python ingest/clean_dataset.py

# Stage 3: Create semantic chunks (~5-12 min)
python ingest/chunk_dataset.py

# Stage 4: Generate embeddings (~20-45 min)
python ingest/embed_chunks.py

Start Using the System

Terminal 1 - Start API Server

uvicorn api.main:app --reload

API runs at: http://127.0.0.1:8000

Terminal 2 - Start Web UI

streamlit run app.py

UI opens at: http://localhost:8501

That's it! You can now query through the web interface or API.


📚 Project Structure

EpsteinFiles-RAG/
├── ingest/                    # Data processing pipeline
│   ├── download_dataset.py    # Download from Hugging Face
│   ├── clean_dataset.py       # Clean & reconstruct docs
│   ├── chunk_dataset.py       # Semantic chunking
│   └── embed_chunks.py        # Embed & index
├── api/                       # FastAPI backend
│   ├── main.py               # API routes
│   └── models.py             # Data models
├── app.py                     # Streamlit UI
├── requirements.txt           # Python dependencies
├── .env.example              # Environment template
└── README.md                 # This file

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.


🙏 Acknowledgments


📞 Support

Built by: Ankit Kumar Nayak
Full-Stack Developer | AI & RAG Systems

Get Help:


⚠️ Disclaimer

This project is built for research, transparency, and educational purposes. All data is sourced from public records. Users are responsible for complying with applicable laws and ethical guidelines when using this system.


About

A RAG pipeline implementation built on the 'Epstein Files 20K' dataset from Hugging Face (Teyler).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages