feat: add CLI analysis tool for experiment results by ravyg · Pull Request #475 · karpathy/autoresearch

ravyg · 2026-04-03T21:13:22Z

Summary

Adds analysis.py — a CLI version of analysis.ipynb that the autonomous agent can call between experiments for structured feedback
Supports three output modes: human-readable text report, machine-readable JSON (--json), and progress chart (--plot)
Uses only existing dependencies (pandas, numpy, matplotlib) — no new packages

Closes #476

Motivation

The agent logs results to results.tsv but has no programmatic way to analyze them. Currently it has to manually grep/reason about raw TSV data. This script lets the agent run:

uv run analysis.py --json

...and get a structured summary of what's working, what's not, and whether progress is plateauing — directly informing its next experiment choice.

Usage

uv run analysis.py                          # text report to stdout
uv run analysis.py --json                   # machine-readable JSON
uv run analysis.py --plot progress.png      # save progress chart
uv run analysis.py --tsv path/to/results.tsv  # custom TSV path

Example text output

============================================================
AUTORESEARCH EXPERIMENT REPORT
============================================================

Total experiments:  7
  Kept:             4
  Discarded:        2
  Crashed:          1
  Keep rate:        66.7%

Baseline val_bpb:   0.997900
Best val_bpb:       0.988500
Total improvement:  0.009400 (0.94%)
Best experiment:    increase batch size to 2**20
Trajectory:         improving

Top improvements (by delta):
  Rank      Delta         BPB  Description
  --------------------------------------------------------
     1  +0.004700  0.993200  increase LR to 0.04
     2  +0.003100  0.990100  add warmup ratio 0.05
     3  +0.001600  0.988500  increase batch size to 2**20

Example JSON output

{
  "total_experiments": 7,
  "kept": 4,
  "discarded": 2,
  "crashed": 1,
  "keep_rate": 0.6667,
  "baseline_bpb": 0.9979,
  "best_bpb": 0.9885,
  "improvement": 0.0094,
  "improvement_pct": 0.94,
  "best_experiment": "increase batch size to 2**20",
  "top_hits": [...],
  "trajectory": "improving"
}

Test plan

Tested with sample results.tsv (7 experiments, mixed keep/discard/crash)
Tested --json output parses as valid JSON
Tested --plot generates PNG
Tested missing file error handling
Tested empty keep set edge case

Converts the interactive analysis.ipynb into a CLI script that the autonomous agent can call between experiments for structured feedback. uv run analysis.py # text report uv run analysis.py --json # machine-readable JSON uv run analysis.py --plot progress.png # save progress chart Outputs experiment counts, keep/discard/crash rates, baseline vs best val_bpb, top improvements ranked by delta, and a trajectory indicator (improving/plateauing/stuck). Uses only existing dependencies (pandas, numpy, matplotlib).

ravyg · 2026-04-06T16:34:39Z

Hi @karpathy @svlandeg — this PR adds a CLI version of analysis.ipynb so the autonomous agent can get structured feedback between experiments (via uv run analysis.py --json).

Currently the agent logs to results.tsv but has no programmatic way to analyze trends or detect when it's plateauing. This script gives it that — same data as the notebook, but callable from the loop. Details in #476.

No changes to existing files, no new dependencies. Happy to adjust based on feedback.

@MohammadWasi

Covers load_results, compute_stats, trajectory states, edge cases, text report, and save_plot. Uses stdlib unittest only — no new deps. Credit to @MohammadWasi (karpathy#495) for suggesting test coverage.

ravyg · 2026-04-14T19:52:46Z

Hi @svlandeg — gentle nudge on this one.

Following your note on #495 that this PR is the canonical one, I've pushed a follow-up commit (bb7b291) adding test_analysis.py — a stdlib unittest suite (19 tests, no new deps) covering:

TSV parsing, status normalization, and NaN handling in load_results
compute_stats counts, keep-rate, baseline/best/improvement, and top-hits ordering
All five trajectory states: early, improving, plateauing, stuck, no_data
Edge cases: only-crashes, single-keep, no-keeps
print_text_report rendering and JSON-serializability of stats
save_plot file creation and no-op-when-no-keeps behavior

Credit to @MohammadWasi for suggesting test coverage in #495.

Happy to adjust scope, style, or split tests out if you'd prefer a different layout. Let me know if there's anything else blocking on my end.

ravyg mentioned this pull request Apr 8, 2026

feat: CLI analysis tool for experiment results #495

Closed

test: add unittest suite for analysis.py

bb7b291

Covers load_results, compute_stats, trajectory states, edge cases, text report, and save_plot. Uses stdlib unittest only — no new deps. Credit to @MohammadWasi (karpathy#495) for suggesting test coverage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add CLI analysis tool for experiment results#475

feat: add CLI analysis tool for experiment results#475
ravyg wants to merge 2 commits into
karpathy:masterfrom
ravyg:feat-analysis-cli

ravyg commented Apr 3, 2026 •

edited

Loading

Uh oh!

ravyg commented Apr 6, 2026

Uh oh!

ravyg commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ravyg commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Usage

Example text output

Example JSON output

Test plan

Uh oh!

ravyg commented Apr 6, 2026

Uh oh!

ravyg commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ravyg commented Apr 3, 2026 •

edited

Loading