An automated testing pipeline for evaluating RAG (Retrieval-Augmented Generation) system quality using the ragas library and Anthropic (Claude) models.
The system measures RAG performance across 4 core metrics using a test set (test_questions.jsonl) and a knowledge corpus (corpus.jsonl).
This pipeline uses ragas to measure 4 critical metrics:
-
Faithfulness (Answer Fidelity)
- Question: How well is the generated answer supported by the retrieved context?
- Measures: Detects statements in the answer that are not backed by context (hallucinations)
-
Answer Relevancy
- Question: How relevant is the generated answer to the asked question?
- Measures: Analyzes whether the answer deviates from the question or contains unnecessary information
-
Context Precision
- Question: How much of the retrieved context was actually necessary to generate the answer?
- Measures: Noise in the context. High score indicates the system retrieved only relevant documents
-
Context Recall
- Question: Does the retrieved context contain sufficient information to generate the "ideal" answer (ground truth)?
- Measures: Whether the system can find the necessary information to provide the correct answer
- Ollama: Required for
nomic-embed-textmodel (or modify insrc/config.py) - Anthropic API Key: Required for Claude models
- Python 3.8+
Install Ollama embedding model:
ollama pull nomic-embed-textClone and setup:
# Clone the repository
git clone https://github.com/AbdulSametTurkmenoglu/rag_evaluation_pipeline.git
cd ragas
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .\.venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Setup environment variables
cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEYRun the evaluation pipeline:
python run_evaluation.pyWhen executed, the script will:
- Load API keys and model names from
src/config.py - Create or load a LlamaIndex (FAISS) from documents in
data/corpus.jsonl(stored instorage/) - Query each question from
data/test_questions.jsonlthrough the RAG pipeline defined insrc/rag_system.py - Collect
answerandcontextsreturned by the RAG system - Send the collected data (
question,answer,contexts,ground_truth) toragasfor evaluation - Print a detailed report to terminal using
src/reporting.py
====================================================================================================
RAGAS EVALUATION RESULTS
====================================================================================================
QUESTION 1: What is Raskolnikov's crime in Crime and Punishment?
----------------------------------------------------------------------------------------------------
Answer: Raskolnikov's crime is murdering a pawnbroker woman.
Ground Truth: Raskolnikov murders a pawnbroker woman.
Retrieved Documents: 1
METRIC RESULTS (Score 0.0 - 1.0):
• Faithfulness: 1.0000
• Answer Relevancy: 0.9850
• Context Precision: 1.0000
• Context Recall: 1.0000
====================================================================================================
SUMMARY STATISTICS
====================================================================================================
Faithfulness : 0.9500
Answer Relevancy : 0.9925
Context Precision : 1.0000
Context Recall : 1.0000
====================================================================================================
Evaluation completed!
Edit src/config.py to customize:
- Embedding model (default:
nomic-embed-textvia Ollama) - LLM model (default: Claude via Anthropic)
- Evaluation metrics
- Storage paths