A RAG pipeline implementation built on the 'Epstein Files 20K' dataset from Hugging Face (Teyler).
👉 https://huggingface.co/datasets/teyler/epstein-files-20k
👉 https://huggingface.co/datasets/devankit7873/EpsteinFiles-Vector-Embeddings-ChromaDB
Process 2M+ document lines → Get accurate, grounded answers in seconds
What it does:
- Automatically cleans and reconstructs fragmented documents
- Intelligently chunks documents while preserving context
- Embeds everything into a searchable vector database
- Retrieves diverse, relevant information using MMR algorithm
- Generates answers grounded solely in the retrieved context
✅ No Hallucinations - Answers only from source documents
✅ Intelligent Retrieval - MMR algorithm for diverse results
✅ Fast Processing - ~1 second end-to-end query response
✅ Semantic Understanding - Context-aware document chunking
✅ REST API - Easy integration with other systems
✅ Interactive UI - Streamlit web interface included
✅ Scalable - Handles 100K+ document chunks
✅ Production-Ready - Async support, error handling, logging
Stage 1: Data Preparation
Raw Documents (2.5M lines)
↓
Clean & Reconstruct
↓
Smart Chunking
↓
Vector Embeddings
Stage 2: Intelligent Retrieval
User Question
↓
Find Similar Context (MMR)
↓
Return Top Chunks
Stage 3: Grounded Answer
Context + Question
↓
LLaMA 3.3 LLM
↓
Grounded Answer (with sources)
Previous Approach: Pure semantic similarity
→ Returned redundant chunks from same document
Current Approach: Maximal Marginal Relevance (MMR)
→ Balances relevance + diversity for comprehensive context
- Python 3.11+
- 16GB RAM (8GB minimum)
- Groq API key (free at console.groq.com)
1. Clone repository
git clone https://github.com/AnkitNayak-eth/EpsteinFiles-RAG.git
cd EpsteinFiles-RAG2. Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate3. Install dependencies
pip install -r requirements.txt4. Configure environment
Create .env file:
GROQ_API_KEY=your_api_key_here
This processes data and prepares the system for queries:
# Stage 1: Download raw data (~5-15 min)
python ingest/download_dataset.py
# Stage 2: Clean and reconstruct documents (~3-8 min)
python ingest/clean_dataset.py
# Stage 3: Create semantic chunks (~5-12 min)
python ingest/chunk_dataset.py
# Stage 4: Generate embeddings (~20-45 min)
python ingest/embed_chunks.pyTerminal 1 - Start API Server
uvicorn api.main:app --reloadAPI runs at: http://127.0.0.1:8000
Terminal 2 - Start Web UI
streamlit run app.pyUI opens at: http://localhost:8501
That's it! You can now query through the web interface or API.
EpsteinFiles-RAG/
├── ingest/ # Data processing pipeline
│ ├── download_dataset.py # Download from Hugging Face
│ ├── clean_dataset.py # Clean & reconstruct docs
│ ├── chunk_dataset.py # Semantic chunking
│ └── embed_chunks.py # Embed & index
├── api/ # FastAPI backend
│ ├── main.py # API routes
│ └── models.py # Data models
├── app.py # Streamlit UI
├── requirements.txt # Python dependencies
├── .env.example # Environment template
└── README.md # This file
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset: Teyler/Epstein Files 20K on Hugging Face
- Embeddings: Sentence Transformers
- Vector DB: Chroma
- LLM Inference: Groq Cloud
- Framework: LangChain
- UI: Streamlit
Built by: Ankit Kumar Nayak
Full-Stack Developer | AI & RAG Systems
Get Help:
This project is built for research, transparency, and educational purposes. All data is sourced from public records. Users are responsible for complying with applicable laws and ethical guidelines when using this system.
