Skip to content

feat: add CLI analysis tool for experiment results#475

Open
ravyg wants to merge 2 commits into
karpathy:masterfrom
ravyg:feat-analysis-cli
Open

feat: add CLI analysis tool for experiment results#475
ravyg wants to merge 2 commits into
karpathy:masterfrom
ravyg:feat-analysis-cli

Conversation

@ravyg
Copy link
Copy Markdown

@ravyg ravyg commented Apr 3, 2026

Summary

  • Adds analysis.py — a CLI version of analysis.ipynb that the autonomous agent can call between experiments for structured feedback
  • Supports three output modes: human-readable text report, machine-readable JSON (--json), and progress chart (--plot)
  • Uses only existing dependencies (pandas, numpy, matplotlib) — no new packages

Closes #476

Motivation

The agent logs results to results.tsv but has no programmatic way to analyze them. Currently it has to manually grep/reason about raw TSV data. This script lets the agent run:

uv run analysis.py --json

...and get a structured summary of what's working, what's not, and whether progress is plateauing — directly informing its next experiment choice.

Usage

uv run analysis.py                          # text report to stdout
uv run analysis.py --json                   # machine-readable JSON
uv run analysis.py --plot progress.png      # save progress chart
uv run analysis.py --tsv path/to/results.tsv  # custom TSV path

Example text output

============================================================
AUTORESEARCH EXPERIMENT REPORT
============================================================

Total experiments:  7
  Kept:             4
  Discarded:        2
  Crashed:          1
  Keep rate:        66.7%

Baseline val_bpb:   0.997900
Best val_bpb:       0.988500
Total improvement:  0.009400 (0.94%)
Best experiment:    increase batch size to 2**20
Trajectory:         improving

Top improvements (by delta):
  Rank      Delta         BPB  Description
  --------------------------------------------------------
     1  +0.004700  0.993200  increase LR to 0.04
     2  +0.003100  0.990100  add warmup ratio 0.05
     3  +0.001600  0.988500  increase batch size to 2**20

Example JSON output

{
  "total_experiments": 7,
  "kept": 4,
  "discarded": 2,
  "crashed": 1,
  "keep_rate": 0.6667,
  "baseline_bpb": 0.9979,
  "best_bpb": 0.9885,
  "improvement": 0.0094,
  "improvement_pct": 0.94,
  "best_experiment": "increase batch size to 2**20",
  "top_hits": [...],
  "trajectory": "improving"
}

Test plan

  • Tested with sample results.tsv (7 experiments, mixed keep/discard/crash)
  • Tested --json output parses as valid JSON
  • Tested --plot generates PNG
  • Tested missing file error handling
  • Tested empty keep set edge case

Converts the interactive analysis.ipynb into a CLI script that the
autonomous agent can call between experiments for structured feedback.

  uv run analysis.py                    # text report
  uv run analysis.py --json             # machine-readable JSON
  uv run analysis.py --plot progress.png # save progress chart

Outputs experiment counts, keep/discard/crash rates, baseline vs best
val_bpb, top improvements ranked by delta, and a trajectory indicator
(improving/plateauing/stuck). Uses only existing dependencies (pandas,
numpy, matplotlib).
@ravyg
Copy link
Copy Markdown
Author

ravyg commented Apr 6, 2026

Hi @karpathy @svlandeg — this PR adds a CLI version of analysis.ipynb so the autonomous agent can get structured feedback between experiments (via uv run analysis.py --json).

Currently the agent logs to results.tsv but has no programmatic way to analyze trends or detect when it's plateauing. This script gives it that — same data as the notebook, but callable from the loop. Details in #476.

No changes to existing files, no new dependencies. Happy to adjust based on feedback.

Covers load_results, compute_stats, trajectory states, edge cases,
text report, and save_plot. Uses stdlib unittest only — no new deps.

Credit to @MohammadWasi (karpathy#495) for suggesting test coverage.
@ravyg
Copy link
Copy Markdown
Author

ravyg commented Apr 14, 2026

Hi @svlandeg — gentle nudge on this one.

Following your note on #495 that this PR is the canonical one, I've pushed a follow-up commit (bb7b291) adding test_analysis.py — a stdlib unittest suite (19 tests, no new deps) covering:

  • TSV parsing, status normalization, and NaN handling in load_results
  • compute_stats counts, keep-rate, baseline/best/improvement, and top-hits ordering
  • All five trajectory states: early, improving, plateauing, stuck, no_data
  • Edge cases: only-crashes, single-keep, no-keeps
  • print_text_report rendering and JSON-serializability of stats
  • save_plot file creation and no-op-when-no-keeps behavior

Credit to @MohammadWasi for suggesting test coverage in #495.

Happy to adjust scope, style, or split tests out if you'd prefer a different layout. Let me know if there's anything else blocking on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: CLI analysis tool for experiment results — structured feedback for the autonomous agent

1 participant