Skip to content

zircote/autoresearch

autoresearch

Claude Code Plugin Python 3.10+ Tests Docs License

Autonomous skill improvement for Claude Code plugins. Inspired by Andrej Karpathy's autoresearch — an autonomous improvement loop where AI agents iterate on artifacts while humans sleep. Point it at a skill, walk away, come back to a better skill.

autoresearch - autonomous skill improvement loop

Autoresearch runs an improvement loop: modify the skill, evaluate it against fixed evals, keep improvements, discard regressions. Repeat until convergence. No babysitting required.

Install

claude plugins add ./

Quick Start

# Improve a skill automatically
/autoresearch path/to/my-skill

# Create evals for a skill that has none
/autoresearch --eval-doctor path/to/my-skill

# Review results from a previous run
/autoresearch --report path/to/my-skill-autoresearch

How It Works

flowchart TD
    A[Snapshot Baseline] --> B[Evaluate Baseline]
    B --> C[Improve Candidate]
    C --> D[Evaluate Candidate]
    D --> E{Score > Best?}
    E -->|Yes| F[Keep — Snapshot v·N]
    E -->|No| G[Revert — Restore Best]
    F --> H{Stop?}
    G --> H
    H -->|Perfect score ·or· 3 reverts ·or· max iters| I[Convergence Report]
    H -->|Continue| C
    I --> J[Show Diff]
    J --> K{Apply to Original?}
    K -->|Yes| L[Restore Best → Skill]
    K -->|No| M[Changes Stay in Workspace]

    style A fill:#6366f1,color:#fff,stroke:#4f46e5
    style B fill:#f1f5f9,stroke:#94a3b8
    style C fill:#6366f1,color:#fff,stroke:#4f46e5
    style D fill:#f1f5f9,stroke:#94a3b8
    style E fill:#fef3c7,stroke:#f59e0b
    style F fill:#d1fae5,stroke:#10b981
    style G fill:#fee2e2,stroke:#ef4444
    style H fill:#fef3c7,stroke:#f59e0b
    style I fill:#ede9fe,stroke:#8b5cf6
    style J fill:#ede9fe,stroke:#8b5cf6
    style K fill:#fef3c7,stroke:#f59e0b
    style L fill:#d1fae5,stroke:#10b981
    style M fill:#f1f5f9,stroke:#94a3b8
Loading

Three Modes

1. Full Improvement Loop (default)

/autoresearch path/to/my-skill
/autoresearch path/to/my-skill --iterations 8

Runs the complete cycle: snapshot baseline, evaluate, improve, evaluate, keep/discard, repeat. Produces a convergence report and asks whether to apply the best version.

2. Eval Doctor

/autoresearch --eval-doctor path/to/my-skill

Creates or fixes evaluation cases for a skill. Run this first when a skill has no evals/evals.json or when evals are too easy/hard. Does not run the improvement loop.

3. Report

/autoresearch --report path/to/my-skill-autoresearch

Generates a convergence report from an existing workspace. Useful for reviewing results after a run completes.

Architecture

autoresearch architecture

Documentation

Tutorials — Learn by doing

How-To Guides — Solve specific problems

Reference — Look up details

  • CLI Reference — Complete command reference
  • Algorithm — Formal loop specification
  • File Formats — results.tsv, workspace layout, snapshot format
  • Eval Schema — evals.json and trigger-eval.json schemas
  • Agents — Agent specs: improver, eval-doctor, convergence-reporter
  • Scripts — Script API: snapshot.py, score.py, results_log.py, diff_report.py

Explanation — Understand the design

Requirements

  • Claude Code with plugin support
  • The skill-creator plugin (provides the grader agent)
  • Python 3.8+ (for snapshot, scoring, and reporting scripts)

License

See LICENSE.

About

Autonomous skill improvement loop for Claude Code plugins — inspired by Karpathy's autoresearch. Modify → evaluate → keep/discard → repeat until convergence. Zero-touch quality iteration at scale.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages