Autonomous skill improvement for Claude Code plugins. Inspired by Andrej Karpathy's autoresearch — an autonomous improvement loop where AI agents iterate on artifacts while humans sleep. Point it at a skill, walk away, come back to a better skill.
Autoresearch runs an improvement loop: modify the skill, evaluate it against fixed evals, keep improvements, discard regressions. Repeat until convergence. No babysitting required.
claude plugins add ./# Improve a skill automatically
/autoresearch path/to/my-skill
# Create evals for a skill that has none
/autoresearch --eval-doctor path/to/my-skill
# Review results from a previous run
/autoresearch --report path/to/my-skill-autoresearchflowchart TD
A[Snapshot Baseline] --> B[Evaluate Baseline]
B --> C[Improve Candidate]
C --> D[Evaluate Candidate]
D --> E{Score > Best?}
E -->|Yes| F[Keep — Snapshot v·N]
E -->|No| G[Revert — Restore Best]
F --> H{Stop?}
G --> H
H -->|Perfect score ·or· 3 reverts ·or· max iters| I[Convergence Report]
H -->|Continue| C
I --> J[Show Diff]
J --> K{Apply to Original?}
K -->|Yes| L[Restore Best → Skill]
K -->|No| M[Changes Stay in Workspace]
style A fill:#6366f1,color:#fff,stroke:#4f46e5
style B fill:#f1f5f9,stroke:#94a3b8
style C fill:#6366f1,color:#fff,stroke:#4f46e5
style D fill:#f1f5f9,stroke:#94a3b8
style E fill:#fef3c7,stroke:#f59e0b
style F fill:#d1fae5,stroke:#10b981
style G fill:#fee2e2,stroke:#ef4444
style H fill:#fef3c7,stroke:#f59e0b
style I fill:#ede9fe,stroke:#8b5cf6
style J fill:#ede9fe,stroke:#8b5cf6
style K fill:#fef3c7,stroke:#f59e0b
style L fill:#d1fae5,stroke:#10b981
style M fill:#f1f5f9,stroke:#94a3b8
/autoresearch path/to/my-skill
/autoresearch path/to/my-skill --iterations 8Runs the complete cycle: snapshot baseline, evaluate, improve, evaluate, keep/discard, repeat. Produces a convergence report and asks whether to apply the best version.
/autoresearch --eval-doctor path/to/my-skillCreates or fixes evaluation cases for a skill. Run this first when a skill has no evals/evals.json or when evals are too easy/hard. Does not run the improvement loop.
/autoresearch --report path/to/my-skill-autoresearchGenerates a convergence report from an existing workspace. Useful for reviewing results after a run completes.
- Getting Started — Your first autoresearch loop
- Creating Evals from Scratch — Building evals for a bare skill
- Improving an Existing Skill — Taking a skill from 65% to 90%+
- Run the Improvement Loop — Execute the core loop with all options
- Manage Evals — Create, fix, update eval cases
- Interpret Results — Read results.tsv and convergence reports
- Customize Iterations — Change max iterations, understand abort thresholds
- Apply Changes — Review and apply the best version
- Recover from Failure — Resume after interruption, inspect snapshots
- Integrate with Skill Creator — Post-loop description optimization
- CLI Reference — Complete command reference
- Algorithm — Formal loop specification
- File Formats — results.tsv, workspace layout, snapshot format
- Eval Schema — evals.json and trigger-eval.json schemas
- Agents — Agent specs: improver, eval-doctor, convergence-reporter
- Scripts — Script API: snapshot.py, score.py, results_log.py, diff_report.py
- The Autoresearch Pattern — Karpathy's pattern and how it maps to skills
- Eval-Skill Separation — Why evals and skills improve separately
- Convergence and Scoring — How scoring works, what convergence means
- Lifecycle — Full lifecycle from start to finish
- Component Architecture — How orchestrator, agents, and scripts interact
- Expected Results — Typical score trajectories and failure modes
- Claude Code with plugin support
- The
skill-creatorplugin (provides the grader agent) - Python 3.8+ (for snapshot, scoring, and reporting scripts)
See LICENSE.