autoresearch

Autonomous skill improvement for Claude Code plugins. Inspired by Andrej Karpathy's autoresearch — an autonomous improvement loop where AI agents iterate on artifacts while humans sleep. Point it at a skill, walk away, come back to a better skill.

Autoresearch runs an improvement loop: modify the skill, evaluate it against fixed evals, keep improvements, discard regressions. Repeat until convergence. No babysitting required.

Install

claude plugins add ./

Quick Start

# Improve a skill automatically
/autoresearch path/to/my-skill

# Create evals for a skill that has none
/autoresearch --eval-doctor path/to/my-skill

# Review results from a previous run
/autoresearch --report path/to/my-skill-autoresearch

How It Works

flowchart TD
    A[Snapshot Baseline] --> B[Evaluate Baseline]
    B --> C[Improve Candidate]
    C --> D[Evaluate Candidate]
    D --> E{Score > Best?}
    E -->|Yes| F[Keep — Snapshot v·N]
    E -->|No| G[Revert — Restore Best]
    F --> H{Stop?}
    G --> H
    H -->|Perfect score ·or· 3 reverts ·or· max iters| I[Convergence Report]
    H -->|Continue| C
    I --> J[Show Diff]
    J --> K{Apply to Original?}
    K -->|Yes| L[Restore Best → Skill]
    K -->|No| M[Changes Stay in Workspace]

    style A fill:#6366f1,color:#fff,stroke:#4f46e5
    style B fill:#f1f5f9,stroke:#94a3b8
    style C fill:#6366f1,color:#fff,stroke:#4f46e5
    style D fill:#f1f5f9,stroke:#94a3b8
    style E fill:#fef3c7,stroke:#f59e0b
    style F fill:#d1fae5,stroke:#10b981
    style G fill:#fee2e2,stroke:#ef4444
    style H fill:#fef3c7,stroke:#f59e0b
    style I fill:#ede9fe,stroke:#8b5cf6
    style J fill:#ede9fe,stroke:#8b5cf6
    style K fill:#fef3c7,stroke:#f59e0b
    style L fill:#d1fae5,stroke:#10b981
    style M fill:#f1f5f9,stroke:#94a3b8

Three Modes

1. Full Improvement Loop (default)

/autoresearch path/to/my-skill
/autoresearch path/to/my-skill --iterations 8

Runs the complete cycle: snapshot baseline, evaluate, improve, evaluate, keep/discard, repeat. Produces a convergence report and asks whether to apply the best version.

2. Eval Doctor

/autoresearch --eval-doctor path/to/my-skill

Creates or fixes evaluation cases for a skill. Run this first when a skill has no evals/evals.json or when evals are too easy/hard. Does not run the improvement loop.

3. Report

/autoresearch --report path/to/my-skill-autoresearch

Generates a convergence report from an existing workspace. Useful for reviewing results after a run completes.

Architecture

Documentation

Tutorials — Learn by doing

Getting Started — Your first autoresearch loop
Creating Evals from Scratch — Building evals for a bare skill
Improving an Existing Skill — Taking a skill from 65% to 90%+

How-To Guides — Solve specific problems

Run the Improvement Loop — Execute the core loop with all options
Manage Evals — Create, fix, update eval cases
Interpret Results — Read results.tsv and convergence reports
Customize Iterations — Change max iterations, understand abort thresholds
Apply Changes — Review and apply the best version
Recover from Failure — Resume after interruption, inspect snapshots
Integrate with Skill Creator — Post-loop description optimization

Reference — Look up details

CLI Reference — Complete command reference
Algorithm — Formal loop specification
File Formats — results.tsv, workspace layout, snapshot format
Eval Schema — evals.json and trigger-eval.json schemas
Agents — Agent specs: improver, eval-doctor, convergence-reporter
Scripts — Script API: snapshot.py, score.py, results_log.py, diff_report.py

Explanation — Understand the design

The Autoresearch Pattern — Karpathy's pattern and how it maps to skills
Eval-Skill Separation — Why evals and skills improve separately
Convergence and Scoring — How scoring works, what convergence means
Lifecycle — Full lifecycle from start to finish
Component Architecture — How orchestrator, agents, and scripts interact
Expected Results — Typical score trajectories and failure modes

Requirements

Claude Code with plugin support
The skill-creator plugin (provides the grader agent)
Python 3.8+ (for snapshot, scoring, and reporting scripts)

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.claude-plugin		.claude-plugin
.claude		.claude
.github		.github
docs		docs
skills/autoresearch		skills/autoresearch
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autoresearch

Install

Quick Start

How It Works

Three Modes

1. Full Improvement Loop (default)

2. Eval Doctor

3. Report

Architecture

Documentation

Tutorials — Learn by doing

How-To Guides — Solve specific problems

Reference — Look up details

Explanation — Understand the design

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

autoresearch

Install

Quick Start

How It Works

Three Modes

1. Full Improvement Loop (default)

2. Eval Doctor

3. Report

Architecture

Documentation

Tutorials — Learn by doing

How-To Guides — Solve specific problems

Reference — Look up details

Explanation — Understand the design

Requirements

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages