A FastAPI-based application for extracting text from PDFs, generating embeddings, and performing hybrid search with RAG.
- Extract text from PDFs using PyPDF2
- Generate embeddings with SentenceTransformers
- Store embeddings in PostgreSQL with pgvector
- Index text in Elasticsearch for keyword search
- Hybrid search combining keyword and semantic results
- RAG pipeline with LangChain for contextual responses
- Cache results with Redis for performance
- Python 3.10+
- PostgreSQL with pgvector extension
- Elasticsearch
- Redis
- OpenAI API key
-
Clone the repository:
git clone <repository-url> cd hybrid_search
-
Install dependencies using Poetry:
poetry install
-
Set up environment variables:
export OPENAI_API_KEY='your-api-key'
-
Ensure PostgreSQL, Elasticsearch, and Redis are running.
-
Initialize the database:
poetry run python -m database.db_init
-
Start the FastAPI server:
poetry run uvicorn main:app --reload
- Upload PDF:
POST /upload_pdf/with a PDF file to extract text and store embeddings. - Search:
POST /search/with a JSON payload{ "query": "your query", "top_k": 5 }to perform hybrid search.
pdf-search-api/
├── database/
│ └── db_init.py
├── services/
│ ├── pdf_service.py
│ ├── embedding_service.py
│ └── search_service.py
├── routers/
│ ├── pdf_router.py
│ └── search_router.py
├── main.py
├── pyproject.toml
└── README.md
Run tests with:
poetry run pytest- Ensure PostgreSQL has the pgvector extension installed.
- Configure Elasticsearch and Redis connection settings as needed.
- Adjust
top_kin search queries for desired result count.