feat: add CLI analysis tool for experiment results#475
Conversation
Converts the interactive analysis.ipynb into a CLI script that the autonomous agent can call between experiments for structured feedback. uv run analysis.py # text report uv run analysis.py --json # machine-readable JSON uv run analysis.py --plot progress.png # save progress chart Outputs experiment counts, keep/discard/crash rates, baseline vs best val_bpb, top improvements ranked by delta, and a trajectory indicator (improving/plateauing/stuck). Uses only existing dependencies (pandas, numpy, matplotlib).
|
Hi @karpathy @svlandeg — this PR adds a CLI version of Currently the agent logs to No changes to existing files, no new dependencies. Happy to adjust based on feedback. |
Covers load_results, compute_stats, trajectory states, edge cases, text report, and save_plot. Uses stdlib unittest only — no new deps. Credit to @MohammadWasi (karpathy#495) for suggesting test coverage.
|
Hi @svlandeg — gentle nudge on this one. Following your note on #495 that this PR is the canonical one, I've pushed a follow-up commit (bb7b291) adding
Credit to @MohammadWasi for suggesting test coverage in #495. Happy to adjust scope, style, or split tests out if you'd prefer a different layout. Let me know if there's anything else blocking on my end. |
Summary
analysis.py— a CLI version ofanalysis.ipynbthat the autonomous agent can call between experiments for structured feedback--json), and progress chart (--plot)Closes #476
Motivation
The agent logs results to
results.tsvbut has no programmatic way to analyze them. Currently it has to manually grep/reason about raw TSV data. This script lets the agent run:...and get a structured summary of what's working, what's not, and whether progress is plateauing — directly informing its next experiment choice.
Usage
Example text output
Example JSON output
{ "total_experiments": 7, "kept": 4, "discarded": 2, "crashed": 1, "keep_rate": 0.6667, "baseline_bpb": 0.9979, "best_bpb": 0.9885, "improvement": 0.0094, "improvement_pct": 0.94, "best_experiment": "increase batch size to 2**20", "top_hits": [...], "trajectory": "improving" }Test plan
--jsonoutput parses as valid JSON--plotgenerates PNG