A Retrieval-Augmented Generation (RAG) system that enables intelligent question-answering over PDF documents using local LLMs and embeddings.
- PDF Document Ingestion: Automatically extract and process text from PDF files
- Intelligent Text Chunking: Split documents into optimal chunks for processing
- Vector Embeddings: Generate semantic embeddings using Ollama models
- Similarity Search: Find relevant document sections based on query similarity
- Question Answering: Generate contextual answers using retrieved information
- Local Processing: Runs entirely on your infrastructure using Ollama
- Python 3.8 or higher
- Ollama installed and running
- At least 8GB RAM (16GB recommended for larger documents)
# Clone the repository
git clone <your-repo-url>
cd pdf-rag-pipeline
# Install required packages
pip install -r requirements.txt# Pull required models
ollama pull llama3.2
ollama pull nomic-embed-text
# Start Ollama server (if not already running)
ollama servePlace your PDF file in the ./data/ directory or update the doc_pwd variable in the script.
python main.pyThe RAG pipeline follows these steps:
loader = UnstructuredPDFLoader(file_path=doc_pwd)
data = loader.load()- Loads PDF documents using LangChain's document loaders
- Extracts text content while preserving document structure
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=300)
chunks = text_splitter.split_documents(data)- Splits large documents into manageable chunks
- Maintains context overlap between chunks for better retrieval
embeddings = OllamaEmbeddings(model="nomic-embed-text")- Converts text chunks into vector embeddings
- Uses Ollama's
nomic-embed-textmodel for semantic representation
vector_store = Chroma.from_documents(documents=chunks, embedding=embeddings)- Saves embeddings in Chroma vector database
- Enables fast similarity search capabilities
retriever = vector_store.as_retriever(search_kwargs={"k": 4})- Finds most relevant document chunks for user queries
- Returns top-k similar documents based on embedding similarity
rag_pipeline = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)- Combines retrieved context with user questions
- Uses LLM to generate comprehensive answers
pdf-rag-pipeline/
βββ main.py # Main RAG pipeline script
βββ requirements.txt # Python dependencies
βββ data/ # PDF documents directory
β βββ your_document.pdf
βββ chroma_db/ # Vector database storage (auto-created)
βββ README.md # This file
ollama
chromadb
pdfplumber
langchain
langchain-core
langchain-ollama
langchain-community
langchain_text_splitters
unstructured[pdf]
fastembed
pikepdf
elevenlabs
PyMuPDFLoader- PDF Path: Update
doc_pwdvariable for your document location - Chunk Size: Adjust
chunk_size(default: 1200) for different document types - Chunk Overlap: Modify
chunk_overlap(default: 300) to maintain context
- LLM Model: Change
modelvariable (default: "llama3.2") - Embedding Model: Modify embedding model in
OllamaEmbeddings - Remote Ollama: Update
remote_ollamaURL if using remote instance
- Search Results: Adjust
kparameter in retriever (default: 4) - Search Type: Choose between "similarity", "mmr", or "similarity_score_threshold"
response = rag_pipeline.invoke("What is the main topic of this document?")
print(response)queries = [
"What are the key findings?",
"What recommendations are made?",
"Who are the main stakeholders?"
]
for query in queries:
response = rag_pipeline.invoke(query)
print(f"Q: {query}")
print(f"A: {response}\n")- Use
PyMuPDFLoaderinstead ofUnstructuredPDFLoaderfor faster PDF processing - Reduce chunk size to 800-1000 characters
- Use
llama3.2:1bfor faster inference - Enable GPU acceleration in Ollama
- Process documents in batches for large collections
- Use persistent vector storage to avoid re-processing
- Clear unused variables after processing
# Faster PDF loading
loader = PyMuPDFLoader(file_path=doc_pwd)
# Optimized chunking
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=200
)
# Persistent storage
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)| Operation | Time (Small PDF) | Time (Large PDF) |
|---|---|---|
| PDF Loading | 1-3 seconds | 5-15 seconds |
| Text Chunking | <1 second | 1-5 seconds |
| Embedding Creation | 10-30 seconds | 1-5 minutes |
| Query Processing | 15-30 seconds | 20-45 seconds |
1. Ollama Connection Error
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Restart Ollama if needed
ollama serve2. Memory Issues
- Reduce chunk size and batch size
- Use smaller models (llama3.2:1b)
- Process documents individually
3. Slow Performance
- Run Ollama locally instead of remote server
- Use SSD storage for vector database
- Enable GPU acceleration
4. PDF Loading Errors
- Install additional dependencies:
pip install unstructured[all-docs] - Try different PDF loaders (PyMuPDF, PDFPlumber)
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Ollama for local LLM inference
- LangChain for RAG pipeline framework
- Chroma for vector database
- Unstructured for PDF processing