RAG Evaluation Pipeline (RAGAS)

An automated testing pipeline for evaluating RAG (Retrieval-Augmented Generation) system quality using the ragas library and Anthropic (Claude) models.

The system measures RAG performance across 4 core metrics using a test set (test_questions.jsonl) and a knowledge corpus (corpus.jsonl).

Evaluation Metrics

This pipeline uses ragas to measure 4 critical metrics:

Faithfulness (Answer Fidelity)
- Question: How well is the generated answer supported by the retrieved context?
- Measures: Detects statements in the answer that are not backed by context (hallucinations)
Answer Relevancy
- Question: How relevant is the generated answer to the asked question?
- Measures: Analyzes whether the answer deviates from the question or contains unnecessary information
Context Precision
- Question: How much of the retrieved context was actually necessary to generate the answer?
- Measures: Noise in the context. High score indicates the system retrieved only relevant documents
Context Recall
- Question: Does the retrieved context contain sufficient information to generate the "ideal" answer (ground truth)?
- Measures: Whether the system can find the necessary information to provide the correct answer

Prerequisites

Ollama: Required for nomic-embed-text model (or modify in src/config.py)
Anthropic API Key: Required for Claude models
Python 3.8+

Installation

Install Ollama embedding model:

ollama pull nomic-embed-text

Clone and setup:

# Clone the repository
git clone https://github.com/AbdulSametTurkmenoglu/rag_evaluation_pipeline.git
cd ragas

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .\.venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Setup environment variables
cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY

Usage

Run the evaluation pipeline:

python run_evaluation.py

Pipeline Steps

When executed, the script will:

Load API keys and model names from src/config.py
Create or load a LlamaIndex (FAISS) from documents in data/corpus.jsonl (stored in storage/)
Query each question from data/test_questions.jsonl through the RAG pipeline defined in src/rag_system.py
Collect answer and contexts returned by the RAG system
Send the collected data (question, answer, contexts, ground_truth) to ragas for evaluation
Print a detailed report to terminal using src/reporting.py

Sample Report Output

====================================================================================================
 RAGAS EVALUATION RESULTS
====================================================================================================

 QUESTION 1: What is Raskolnikov's crime in Crime and Punishment?
----------------------------------------------------------------------------------------------------
  Answer: Raskolnikov's crime is murdering a pawnbroker woman.
  Ground Truth: Raskolnikov murders a pawnbroker woman.
  Retrieved Documents: 1

  METRIC RESULTS (Score 0.0 - 1.0):
    • Faithfulness:         1.0000
    • Answer Relevancy:     0.9850
    • Context Precision:    1.0000
    • Context Recall:       1.0000

====================================================================================================
 SUMMARY STATISTICS
====================================================================================================

  Faithfulness        : 0.9500
  Answer Relevancy    : 0.9925
  Context Precision   : 1.0000
  Context Recall      : 1.0000

====================================================================================================

 Evaluation completed!

Configuration

Edit src/config.py to customize:

Embedding model (default: nomic-embed-text via Ollama)
LLM model (default: Claude via Anthropic)
Evaluation metrics
Storage paths

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
data		data
src		src
.env		.env
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
run_evaluation.py		run_evaluation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Evaluation Pipeline (RAGAS)

Evaluation Metrics

Prerequisites

Installation

Usage

Pipeline Steps

Sample Report Output

Configuration

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

AbdulSametTurkmenoglu/rag_evaluation_pipeline

Folders and files

Latest commit

History

Repository files navigation

RAG Evaluation Pipeline (RAGAS)

Evaluation Metrics

Prerequisites

Installation

Usage

Pipeline Steps

Sample Report Output

Configuration

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages