A flexible framework for comparing Retrieval-Augmented Generation (RAG) systems side-by-side, with support for subjective quality evaluation using LLMs.
- Multi-tool Support: Compare multiple RAG tools in parallel
- Flexible Adapters: Easy-to-extend adapter pattern for adding new tools
- Multiple Output Formats: Display, JSON, Markdown, and summary formats
- Performance Metrics: Automatic latency measurement and result statistics
- LLM Evaluation: Support for subjective quality assessment using Claude 4.1 Opus
- Rich CLI: Beautiful terminal output with tables and panels
- Comprehensive Testing: 90+ tests ensuring reliability
- Python 3.9+
- uv - Fast Python package installer and resolver
To install uv:
# On macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or with Homebrew
brew install uv
# Or with pip
pip install uv
# Clone the repository
git clone https://github.com/ansari-project/ragdiff.git
cd ragdiff
# Install dependencies with uv
uv sync --all-extras # Install all dependencies including dev tools
# Or install only core dependencies
uv sync
# Or install with goodmem support
uv sync --extra goodmem
# Copy environment template
cp .env.example .env
# Edit .env and add your API keys
Create a configs/tools.yaml
file:
tools:
mawsuah:
api_key_env: VECTARA_API_KEY
corpus_id: ${VECTARA_CORPUS_ID}
base_url: https://api.vectara.io
timeout: 30
goodmem:
api_key_env: GOODMEM_API_KEY
base_url: https://api.goodmem.ai
timeout: 30
llm:
model: claude-opus-4-1-20250805
api_key_env: ANTHROPIC_API_KEY
# Compare all configured tools
uv run python -m src.cli compare "What is Islamic inheritance law?"
# Compare specific tools
uv run python -m src.cli compare "Your query" --tool mawsuah --tool goodmem
# Adjust number of results
uv run python -m src.cli compare "Your query" --top-k 10
# Default display format (side-by-side)
uv run python -m src.cli compare "Your query"
# JSON output
uv run python -m src.cli compare "Your query" --format json
# Markdown output
uv run python -m src.cli compare "Your query" --format markdown
# Summary output
uv run python -m src.cli compare "Your query" --format summary
# Save to file
uv run python -m src.cli compare "Your query" --output results.json --format json
Run multiple queries and get comprehensive analysis:
# Basic batch comparison
uv run python -m src.cli batch inputs/tafsir-test-queries.txt \
--config configs/tafsir.yaml \
--top-k 10 \
--format json
# With LLM evaluation (generates holistic summary)
uv run python -m src.cli batch inputs/tafsir-test-queries.txt \
--config configs/tafsir.yaml \
--evaluate \
--top-k 10 \
--format json
# Custom output directory
uv run python -m src.cli batch inputs/tafsir-test-queries.txt \
--config configs/tafsir.yaml \
--evaluate \
--output-dir my-results \
--format jsonl
The batch command with --evaluate
generates:
- Individual query results in JSON/JSONL/CSV format
- Latency statistics (P50, P95, P99)
- LLM evaluation summary showing wins and quality scores
- Holistic summary (markdown file) with:
- Query-by-query breakdown with winners and scores
- Common themes: win distribution, recurring issues
- Key differentiators: what makes winner better vs loser weaknesses
- Overall verdict with production recommendation
Convert holistic summary to PDF:
# Generate PDF from markdown summary
python md2pdf.py outputs/holistic_summary_TIMESTAMP.md
# List available tools
uv run python -m src.cli list-tools
# Validate configuration
uv run python -m src.cli validate-config
# Run quick test
uv run python -m src.cli quick-test
# Get help
uv run python -m src.cli --help
uv run python -m src.cli compare --help
uv run python -m src.cli batch --help
ragdiff/
├── src/
│ ├── core/ # Core models and configuration
│ │ ├── models.py # Data models (RagResult, ComparisonResult, etc.)
│ │ └── config.py # Configuration management
│ ├── adapters/ # Tool adapters
│ │ ├── base.py # Base adapter implementing SearchVectara interface
│ │ ├── mawsuah.py # Vectara/Mawsuah adapter
│ │ ├── goodmem.py # Goodmem adapter with mock fallback
│ │ └── factory.py # Adapter factory
│ ├── comparison/ # Comparison engine
│ │ └── engine.py # Parallel/sequential search execution
│ ├── display/ # Display formatters
│ │ └── formatter.py # Multiple output format support
│ └── cli.py # Typer CLI implementation
├── tests/ # Comprehensive test suite
├── configs/ # Configuration files
└── requirements.txt # Python dependencies
The tool follows the SPIDER protocol for systematic development:
- Specification: Clear goals for subjective RAG comparison
- Planning: Phased implementation approach
- Implementation: Clean architecture with separation of concerns
- Defense: Comprehensive test coverage (90+ tests)
- Evaluation: Expert review and validation
- Commit: Version control with clear history
- BaseRagTool: Abstract base implementing SearchVectara interface
- Adapters: Tool-specific implementations (Mawsuah, Goodmem)
- ComparisonEngine: Orchestrates parallel/sequential searches
- ComparisonFormatter: Handles multiple output formats
- Config: Manages YAML configuration with environment variables
- Create a new adapter in
src/adapters/
:
from .base import BaseRagTool
from ..core.models import RagResult
class MyToolAdapter(BaseRagTool):
def search(self, query: str, top_k: int = 5) -> List[RagResult]:
# Implement tool-specific search
results = self.client.search(query, limit=top_k)
return [self._convert_to_rag_result(r) for r in results]
- Register in
src/adapters/factory.py
:
ADAPTER_REGISTRY["mytool"] = MyToolAdapter
- Add configuration in
configs/tools.yaml
:
tools:
mytool:
api_key_env: MYTOOL_API_KEY
base_url: https://api.mytool.com
# Run all tests
uv run pytest tests/
# Run specific test file
uv run pytest tests/test_cli.py
# Run with coverage
uv run pytest tests/ --cov=src
The project uses:
- Black for formatting
- Ruff for linting
- MyPy for type checking
# Format code with Black
uv run black src/ tests/
# Check linting with Ruff
uv run ruff check src/ tests/
# Type checking with MyPy
uv run mypy src/
Required environment variables:
VECTARA_API_KEY
: For Mawsuah/Vectara accessVECTARA_CORPUS_ID
: Vectara corpus IDGOODMEM_API_KEY
: For Goodmem access (optional, uses mock if not set)ANTHROPIC_API_KEY
: For LLM evaluation (optional)
[Your License]
Contributions welcome! Please follow the existing code style and add tests for new features.
Built following the SPIDER protocol for systematic development.