DocuChat provides a complete, ready-to-use Retrieval-Augmented Generation (RAG) system for building a question-answering chatbot over your own documents. It features a clean, asynchronous Python architecture, a choice of a command-line or web interface, and an efficient data ingestion pipeline.
It is built with a modern, production-ready stack, making it a solid foundation for more complex AI applications.
- RAG Pipeline: Implements a robust Retrieve-then-Generate pipeline using
LangChain
andLangGraph
for state management. - Pluggable LLMs: Easily configurable to use different Large Language Models. It is pre-configured for Google's Gemini family.
- High-Quality Embeddings: Utilizes Hugging Face
sentence-transformers
for generating document embeddings, with automatic device detection (CUDA/CPU). - Persistent Vector Storage: Leverages PostgreSQL with the
pgvector
extension for scalable and persistent vector storage, managed viaSQLAlchemy
andLangChain
's integrations. - Efficient Data Ingestion: An idempotent script processes and indexes your documents, using content hashing (SHA1) to prevent re-indexing duplicates, saving time and compute.
- Dual Interfaces:
- An interactive Command-Line Interface (CLI) for quick tests and local use.
- A user-friendly web interface built with
Gradio
.
- Containerized Database: Comes with a
docker-compose
setup for spinning up the PostgreSQL/pgvector database in a single command. - Modern Python Stack: Built with Python 3.12,
asyncio
for concurrent operations, andPydantic
for robust, type-safe configuration management.
- Orchestration: LangChain, LangGraph
- LLM: Google Gemini (via
langchain-google-genai
) - Embeddings: Hugging Face Sentence Transformers
- Vector Database: PostgreSQL + pgvector
- Web UI: Gradio
- Infrastructure: Docker
- Configuration: Pydantic
Follow these steps to get the RAG system running on your local machine.
- Git
- Docker and Docker Compose
- Python 3.12+
git clone https://github.com/hnthap/docu-chat.git
cd docu-chat
The project uses .env
files for configuration. You'll need to create two.
First, create a file for the database credentials at pgvector/.env
:
# pgvector/.env
POSTGRES_USER=langchain
POSTGRES_PASSWORD=langchain
POSTGRES_DB=langchain
Next, create the main configuration file .env
in the project root. This is where you'll add your API key.
# .env
GOOGLE_API_KEY="your-google-gemini-api-key"
# These should match the values in pgvector/.env
POSTGRES_HOST=localhost
POSTGRES_PORT=6024
# Vector store settings
TABLE_NAME=vectorstore
VECTOR_SIZE=768 # Corresponds to the default embedding model 'all-mpnet-base-v2'
It is highly recommended to use a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
You can also use conda
. Using a virtual environment would help to preserve reproducibility and prevent confliction with existing packages.
Then, install the rest of the project dependencies:
pip install -r requirements.txt
Run the following command from the project root to start the PostgreSQL database in a Docker container.
docker-compose up -d
The database will be available on localhost:6024
.
Place your text files (.txt
) inside the data/
directory (you may need to create it). The ingestion script will find them recursively.
├── data/
│ ├── project_a_report.txt
│ └── financials/
│ └── q4_earnings.txt
└── ...
Notes that you can use this with financial reports, encyclopedias, project descriptions, or any other textual contents. There is no limitations, but you need to fulfill your obligations relating to copyright, etc. Here is an example where I use the Catholic Encyclopedia (1912):
Run the ingestion script. This will read your documents, split them into chunks, generate embeddings, and store them in the vector database.
python build_index.py
You will see progress bars as the documents are processed. Thanks to content hashing, you can run this script multiple times, and it will only add new, unseen document chunks.
You can interact with the RAG system via the CLI or the web UI.
To start the command-line interface:
python rag.py
You can then start asking questions. Type \exit
to quit.
To launch the web application:
python rag.py serve
Open your browser and navigate to the local URL provided (usually http://127.0.0.1:7860
).
A brief overview of the key files in this project:
.
├── config.py # Pydantic-based settings management.
├── docker-compose.yml # Docker configuration for the pgvector database.
├── helper.py # Utility functions for logging and I/O.
├── build_index.py # Script for processing and indexing documents.
├── rag.py # Main application logic, including the RAG class, LangGraph definition, and interfaces (CLI, Gradio).
├── requirements.txt # Python package dependencies.
└── data/ # (Create this) Directory for your source documents.
- Testing: Implement a full test suite with
pytest
(unit and integration tests) to ensure reliability. - Observability: Integrate structured logging (e.g., JSON logs) and add observability hooks for monitoring in production.
- Error Handling: Introduce more specific exception classes and retry mechanisms (e.g., with
tenacity
) for network-dependent operations. - CI/CD: Set up a GitHub Actions workflow for automated testing, linting, and building.
- Containerize Application: Dockerize the main Python application for consistent deployments.
- Session Management: Implement session-scoped conversation history for a multi-user web deployment.