dalek-lean-bench

A benchmark that simulates the real-world software verification workflow by replaying the historical order in which theorems were proven in the curve25519-dalek Lean verification project.

Motivation

Most theorem-proving benchmarks evaluate models on isolated problems in a vacuum. Real verification work is different: a verifier always has a growing body of already-proven lemmas to build on, and the difficulty of each new theorem depends on what has already been established.

This benchmark captures that structure. It uses the commit-by-commit history of a real, ongoing Lean 4 verification project — formally verifying curve25519-dalek, a widely-used Rust implementation of elliptic curve cryptography. Each entry in the benchmark corresponds to a theorem that was proven at a specific point in the project's development. When evaluating a model on theorem B at timeline position i:

Theorems 0 … i-1 (proven before B) retain their full proofs — available as established lemmas.
Theorem B itself has sorry injected — this is the target the model must prove.
Theorems i+1 … N (proven after B) also have sorry injected — simulating that they do not yet exist.

This mirrors the proof context that was actually available when B was originally written, making it the most realistic evaluation setting for agentic theorem proving.

Dataset at a Glance

Artifact	Description
`proof_timeline.csv`	249 theorems in the order they were historically proven
`dataset.jsonl`	146 theorem–proof pairs extracted from git history
`Curve25519Dalek/Specs/`	~150 hand-written Lean 4 spec files (formal statements)
`Curve25519Dalek/Funs.lean`	Auto-generated Lean translation of Rust via Aeneas

The underlying Lean project covers the full curve25519-dalek library: field arithmetic over GF(2²⁵⁵−19), scalar arithmetic, Edwards/Montgomery/Ristretto group operations, and scalar multiplication.

How It Works

curve25519-dalek (Rust)
        │
        │  Aeneas (Rust → Lean extraction)
        ▼
Curve25519Dalek/Funs.lean          ← auto-generated, never edited
        │
        │  Human/LLM writes formal specs + proofs
        ▼
Curve25519Dalek/Specs/**/*.lean    ← one file per Rust function
        │
        │  git history records when each proof was completed
        ▼
proof_timeline.csv                 ← 249-entry ordered timeline
        │
        │  eval_timeline.py: replay history, inject sorry, run agent
        ▼
results/*.jsonl                    ← pass/fail per theorem, per model

Evaluation loop (per theorem)

Create an isolated git worktree at HEAD.
Inject sorry into the target theorem and all theorems proven after it.
Run the LLM agent (e.g., Claude Code CLI) with a budget cap; the agent edits the spec file and runs lake build iteratively.
Record pass/fail, agent wall time, token cost, and build output.
Clean up the worktree.

Quick Start

Prerequisites

# Lean 4 / Lake (see lean-toolchain for exact version)
curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh

# Build all Lean dependencies once (takes ~10–30 min on first run)
lake build

# Claude Code CLI (required for the default agent)
npm install -g @anthropic-ai/claude-code
export ANTHROPIC_API_KEY=sk-...

Run the timeline evaluation

# Smoke-test: first 5 timeline entries
python scripts/eval_timeline.py --limit 5

# Single theorem by ID
python scripts/eval_timeline.py --ids tl_0007_add

# Full run (parallel, resumable)
python scripts/eval_timeline.py --output results/run-$(date +%Y%m%d).jsonl

# Resume an interrupted run
python scripts/eval_timeline.py --output results/run-20260101.jsonl --resume

# Dry-run (no API calls, no lake build — just inspect worktree setup)
python scripts/eval_timeline.py --dry-run --setup-only --limit 3 --keep-worktree

# Stream agent output live
python scripts/eval_timeline.py --live --limit 2

Explore the timeline

# List all 249 evaluation IDs with theorem names
python scripts/eval_timeline.py --list

Repository Structure

.
├── Curve25519Dalek/
│   ├── Funs.lean              # Auto-generated Lean translation of Rust (Aeneas output)
│   ├── Types.lean             # Auto-generated type definitions
│   ├── Specs/                 # Formal specifications (one file per Rust function)
│   ├── Math/                  # Supporting math: Edwards, Montgomery, Ristretto
│   ├── Aux.lean               # Shared auxiliary lemmas
│   └── Tactics.lean           # Custom Lean tactics
├── curve25519-dalek/          # Rust source (git submodule, minimal modifications)
├── scripts/
│   ├── eval_timeline.py       # Main benchmark harness (timeline-ordered evaluation)
│   ├── eval.py                # Basic single-shot LLM evaluation
│   ├── eval_claude_code.py    # Agentic evaluation (head / commit-before modes)
│   ├── gen_proof_timeline.py  # Regenerate proof_timeline.csv from git history
│   └── extract_dataset.py     # Extract dataset.jsonl from git history
├── proof_timeline.csv         # 249-entry ordered proof history
├── dataset.jsonl              # 146 theorem–proof pairs for training/eval
├── status.csv                 # Current verification status of all 197 functions
└── extraction_notes.md        # Notes on Aeneas extraction limitations

Evaluation Scripts

Script	Purpose
`eval_timeline.py`	Primary benchmark: timeline-ordered agentic evaluation with sorry injection
`eval_claude_code.py`	Agentic evaluation in "head" or "commit-before" mode
`eval.py`	Single-shot LLM evaluation on `dataset.jsonl` entries
`gen_proof_timeline.py`	Rebuild `proof_timeline.csv` by walking git history
`extract_dataset.py`	Build `dataset.jsonl` by scanning git history for sorry→proof transitions

Output Format

Each results/*.jsonl file has one JSON object per evaluated theorem:

{
  "id": "tl_0007_add",
  "lean_name": "curve25519_dalek.backend.serial.u64.field.FieldElement51.add",
  "spec_theorem": "Curve25519Dalek/Specs/Backend/.../Add.lean",
  "success": true,
  "agent_time_s": 47.2,
  "build_stdout": "...",
  "build_stderr": "",
  "timestamp": "2026-04-13T10:00:00Z"
}

Underlying Verification Project

The benchmark is built on top of the curve25519-dalek Lean verification project. That project:

Uses Aeneas to automatically translate Rust to Lean 4
Follows the Aeneas WP (weakest precondition) style for specs
Has verified 197 Rust functions from curve25519-dalek as of the initial benchmark snapshot
Is maintained by Oliver Butterley and The Beneficial AI Foundation

Dependencies

Dependency	Role
Lean 4 (`lean-toolchain`)	Proof checker and build system
Aeneas	Rust → Lean extraction (Lean package)
Mathlib	Mathematical library
PrimeCert	Cryptographic certificates
Claude Code CLI	Default LLM agent for `eval_timeline.py`
Python 3.7+	Evaluation harness scripts

License

The Lean verification code is licensed under the Apache License 2.0.

The curve25519-dalek Rust source (in curve25519-dalek/) is dual-licensed under Apache 2.0 or MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 296 Commits
.cargo		.cargo
.github		.github
.vscode		.vscode
Curve25519Dalek		Curve25519Dalek
Utils		Utils
curve25519-dalek		curve25519-dalek
scripts		scripts
site		site
third_party		third_party
.envrc		.envrc
.gitignore		.gitignore
.gitmodules		.gitmodules
.typos.toml		.typos.toml
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
Curve25519Dalek.lean		Curve25519Dalek.lean
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
Utils.lean		Utils.lean
aeneas-config.yml		aeneas-config.yml
dataset.jsonl		dataset.jsonl
extraction_notes.md		extraction_notes.md
functions.json		functions.json
lake-manifest.json		lake-manifest.json
lakefile.toml		lakefile.toml
lean-toolchain		lean-toolchain
log		log
package-lock.json		package-lock.json
package.json		package.json
proof_timeline.csv		proof_timeline.csv
src-modifications.diff		src-modifications.diff
status.csv		status.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dalek-lean-bench

Motivation

Dataset at a Glance

How It Works

Evaluation loop (per theorem)

Quick Start

Prerequisites

Run the timeline evaluation

Explore the timeline

Repository Structure

Evaluation Scripts

Output Format

Underlying Verification Project

Dependencies

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dalek-lean-bench

Motivation

Dataset at a Glance

How It Works

Evaluation loop (per theorem)

Quick Start

Prerequisites

Run the timeline evaluation

Explore the timeline

Repository Structure

Evaluation Scripts

Output Format

Underlying Verification Project

Dependencies

License

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages