feat: CLI analysis tool for experiment results#495
Conversation
Replace tiktoken decode/encode approach with direct mergeable_ranks lookup to avoid UTF-8 replacement character inflation in evaluation metrics. The old method could inflate BPB scores when BPE tokens contained invalid UTF-8 sequences, as tiktoken.decode() replaces them with U+FFFD (3 bytes) instead of the actual raw byte length (often 1 byte). Fixes karpathy#384
Explicitly define py-modules in pyproject.toml to resolve setuptools discovery issue that prevents editable installs. This fixes the 'Multiple top-level modules' error when running 'pip install -e .' or 'uv pip install -e .' Fixes karpathy#387
Add analysis.py CLI tool that provides structured feedback for autonomous agents during long experiment runs. Features include: - Text and JSON output formats for human and agent consumption - Progress trajectory analysis (improving/plateauing/stuck) - Experiment statistics and improvement tracking - Progress plot generation with matplotlib - Comprehensive test suite with 20 test cases Usage: uv run analysis.py # text report uv run analysis.py --json # JSON for agents uv run analysis.py --plot progress.png # visualization uv run analysis.py --tsv custom.tsv # custom results file Fixes karpathy#476
|
Hi @MohammadWasi — heads-up that this PR implements the same feature as #475, which I opened on Apr 3 (4 days before this one) and which closes the same issue (#476) that I authored. The two PRs are functionally equivalent:
The only meaningful difference I can spot is that you added a If you'd like to collaborate, please drop a comment on #475 — I'm happy to fold any improvements from your version (e.g. the unittest suite, if @karpathy prefers that style) into my PR and happy to add/give credits for any significant changes. That way the maintainer reviews one PR instead of two parallel implementations of the same feature. |
|
Thanks @svlandeg and @MohammadWasi, appreciate it 🙏 |
Covers load_results, compute_stats, trajectory states, edge cases, text report, and save_plot. Uses stdlib unittest only — no new deps. Credit to @MohammadWasi (karpathy#495) for suggesting test coverage.
Problem
During long autonomous research sessions (50-100+ experiments), the AI agent has no programmatic way to analyze experiment results from
results.tsv. This creates a significant bottleneck where agents:Solution
Implemented a comprehensive CLI analysis tool (analysis.py) that provides structured feedback for both humans and autonomous agents:
Features
Usage