|
| 1 | +# AGENTS.md - AI Assistant Guide for kaggle-benchmarks |
| 2 | + |
| 3 | +## Project Context |
| 4 | + |
| 5 | +`kaggle-benchmarks` is a Python library for rigorously evaluating LLMs on custom tasks using decorators, assertions, and tool-augmented interactions. **Tech Stack:** Python 3.11+, uv (package manager), pytest, ruff, mypy, Protocol Buffers. |
| 6 | + |
| 7 | +## High-Level Architecture |
| 8 | + |
| 9 | +This is a **library-first codebase** organized by functional concerns: |
| 10 | + |
| 11 | +- **`src/kaggle_benchmarks/`** - Core library implementing the benchmark framework |
| 12 | + - Top-level modules define primitives: tasks, assertions, clients, messages, results, runs |
| 13 | + - Subdirectories provide specialized subsystems: `actors/` (LLM interaction), `tools/` (Python interpreter, web search), `envs/` (execution environments), `kaggle/` (platform integration), `ui/` (Panel-based interfaces) |
| 14 | +- **`tests/`** - Pytest test suite mirroring `src/` structure |
| 15 | +- **`research_benchmarks/`** - Reference implementations of academic benchmarks (MathVista, SimpleQA) |
| 16 | +- **`documentation/`** - Quarto-based docs with executable examples in `examples/` |
| 17 | +- **`protos/`** - Protocol Buffer schemas for serialization |
| 18 | +- **`cicd/`** - Docker and CI/CD scripts |
| 19 | + |
| 20 | +**Mental Model:** Users write decorated functions (`@kbench.task`) that prompt LLMs and assert outputs. The library handles orchestration, caching, serialization, and UI rendering. |
| 21 | + |
| 22 | +## Sources of Truth (The Map) |
| 23 | + |
| 24 | +### Configuration & Environment |
| 25 | +- **Environment variables** → Read `.env` file format in `README.md` |
| 26 | +- **Execution modes** → `src/kaggle_benchmarks/_config.py` defines `ExecutionMode` enum and `Config` dataclass |
| 27 | +- **Package metadata** → `pyproject.toml` (dependencies, version, tool configs) |
| 28 | + |
| 29 | +### Core API Surface |
| 30 | +- **Public exports** → `src/kaggle_benchmarks/__init__.py` defines what users import |
| 31 | +- **Task/benchmark decorators** → `src/kaggle_benchmarks/tasks.py` (`@task`, `@benchmark`) |
| 32 | +- **Assertions** → `src/kaggle_benchmarks/assertions.py` (all `assert_*` functions) |
| 33 | +- **LLM clients** → `src/kaggle_benchmarks/clients.py` (client abstraction and resolution) |
| 34 | + |
| 35 | +### Subsystems |
| 36 | +- **Actor system** → `src/kaggle_benchmarks/actors/` (LLMChat, Actor base classes) |
| 37 | +- **Tools** → `src/kaggle_benchmarks/tools/` (Python interpreter, web search) |
| 38 | +- **Execution environments** → `src/kaggle_benchmarks/envs/` (local, docker) |
| 39 | +- **Kaggle integration** → `src/kaggle_benchmarks/kaggle/` (model loading, serialization) |
| 40 | +- **UI components** → `src/kaggle_benchmarks/ui/` (Panel-based rendering) |
| 41 | + |
| 42 | +### Testing & Quality |
| 43 | +- **Test fixtures** → `tests/conftest.py` |
| 44 | +- **Pre-commit hooks** → `.pre-commit-config.yaml` (ruff, addlicense) |
| 45 | +- **Type checking config** → `pyproject.toml` `[tool.mypy]` section |
| 46 | + |
| 47 | +### Documentation |
| 48 | +- **User-facing guides** → `documentation/quick_start.qmd`, `documentation/user_guide.qmd` |
| 49 | +- **Example code** → `documentation/examples/*.py` |
| 50 | + |
| 51 | +## Critical Implementation Rules |
| 52 | + |
| 53 | +1. **Never manually edit generated files** - Files matching `**/*_pb2.py` are auto-generated from `protos/`. Run `cd protos && ./build.sh` to regenerate. |
| 54 | + |
| 55 | +2. **Use `uv` for all dependency operations** - Not `pip`. Commands: `uv pip install`, `uv run --group <group> <command>`. Dependency groups defined in `pyproject.toml` lines 29-72. |
| 56 | + |
| 57 | +## Operational Commands |
| 58 | + |
| 59 | +### Setup |
| 60 | +```bash |
| 61 | +# Create virtual environment and install |
| 62 | +uv venv |
| 63 | +source .venv/bin/activate # Windows: .venv\Scripts\activate |
| 64 | +uv pip install -e . |
| 65 | + |
| 66 | +# Install dev dependencies |
| 67 | +uv pip install -e ".[dev]" |
| 68 | +``` |
| 69 | + |
| 70 | +### Testing |
| 71 | +```bash |
| 72 | +# Run all tests |
| 73 | +uv run --group test pytest tests |
| 74 | + |
| 75 | +# Run specific test file |
| 76 | +uv run --group test pytest tests/test_assertions.py |
| 77 | + |
| 78 | +# Run with verbose output |
| 79 | +uv run --group test pytest tests -v |
| 80 | +``` |
| 81 | + |
| 82 | +### Code Quality |
| 83 | +```bash |
| 84 | +# Format code |
| 85 | +ruff format . |
| 86 | + |
| 87 | +# Lint and auto-fix |
| 88 | +ruff check --fix . |
| 89 | + |
| 90 | +# Type check |
| 91 | +mypy src/ |
| 92 | + |
| 93 | +# Run all pre-commit hooks |
| 94 | +pre-commit run --all-files |
| 95 | +``` |
| 96 | + |
| 97 | +### Protocol Buffers |
| 98 | +```bash |
| 99 | +# Rebuild protobuf definitions (required after editing protos/) |
| 100 | +cd protos && ./build.sh |
| 101 | +``` |
| 102 | + |
| 103 | +### Docker |
| 104 | +```bash |
| 105 | +# Build, run, or start Jupyter |
| 106 | +cd cicd |
| 107 | +./build.sh # Build image |
| 108 | +./run.sh # Run container |
| 109 | +./jupyter.sh # Start Jupyter server |
| 110 | +``` |
0 commit comments