Skip to content

feat: CLI analysis tool for experiment results#495

Closed
MohammadWasi wants to merge 3 commits into
karpathy:masterfrom
MohammadWasi:feat/cli-analysis-tool
Closed

feat: CLI analysis tool for experiment results#495
MohammadWasi wants to merge 3 commits into
karpathy:masterfrom
MohammadWasi:feat/cli-analysis-tool

Conversation

@MohammadWasi
Copy link
Copy Markdown

@MohammadWasi MohammadWasi commented Apr 7, 2026

Problem

During long autonomous research sessions (50-100+ experiments), the AI agent has no programmatic way to analyze experiment results from results.tsv. This creates a significant bottleneck where agents:

  1. Must manually parse raw tab-separated data
  2. Cannot track progress trends effectively
  3. Waste experiments retrying approaches in local minima
  4. Lack structured feedback for decision-making

Solution

Implemented a comprehensive CLI analysis tool (analysis.py) that provides structured feedback for both humans and autonomous agents:

Features

  • Multiple output formats: Text for humans, JSON for agents
  • Progress trajectory analysis: Detects if experiments are improving/plateauing/stuck
  • Comprehensive statistics: Keep rates, improvements, top hits
  • Visualization: Progress plots with matplotlib
  • Flexible input: Custom TSV file paths
  • Agent-ready: JSON output with trajectory insights

Usage

uv run analysis.py                    # Human-readable report
uv run analysis.py --json            # Machine-readable for agents  
uv run analysis.py --plot progress.png # Save visualization
uv run analysis.py --tsv custom.tsv  # Custom results file

Replace tiktoken decode/encode approach with direct
mergeable_ranks lookup to avoid UTF-8 replacement
character inflation in evaluation metrics.

The old method could inflate BPB scores when BPE tokens
contained invalid UTF-8 sequences, as tiktoken.decode()
replaces them with U+FFFD (3 bytes) instead of the
actual raw byte length (often 1 byte).

Fixes karpathy#384
Explicitly define py-modules in pyproject.toml to resolve
setuptools discovery issue that prevents editable installs.

This fixes the 'Multiple top-level modules' error when running
'pip install -e .' or 'uv pip install -e .'

Fixes karpathy#387
Add analysis.py CLI tool that provides structured feedback for autonomous
agents during long experiment runs. Features include:

- Text and JSON output formats for human and agent consumption
- Progress trajectory analysis (improving/plateauing/stuck)
- Experiment statistics and improvement tracking
- Progress plot generation with matplotlib
- Comprehensive test suite with 20 test cases

Usage:
  uv run analysis.py                    # text report
  uv run analysis.py --json            # JSON for agents
  uv run analysis.py --plot progress.png # visualization
  uv run analysis.py --tsv custom.tsv  # custom results file

Fixes karpathy#476
@MohammadWasi MohammadWasi changed the title Feat/cli analysis tool feat: CLI analysis tool for experiment results Apr 7, 2026
@MohammadWasi MohammadWasi marked this pull request as ready for review April 7, 2026 02:31
@ravyg
Copy link
Copy Markdown

ravyg commented Apr 8, 2026

Hi @MohammadWasi — heads-up that this PR implements the same feature as #475, which I opened on Apr 3 (4 days before this one) and which closes the same issue (#476) that I authored.

The two PRs are functionally equivalent:

  • Same analysis.py filename
  • Same CLI flags: --json, --plot, --tsv
  • Same trajectory states: improving / plateauing / stuck
  • Same JSON output shape and same text report structure
  • Same set of computed stats: experiment counts, baseline vs best, top hits, keep rate

The only meaningful difference I can spot is that you added a test_analysis.py file. Other deltas are stylistic (slightly different function decomposition, different magic numbers in the trajectory thresholds).

If you'd like to collaborate, please drop a comment on #475 — I'm happy to fold any improvements from your version (e.g. the unittest suite, if @karpathy prefers that style) into my PR and happy to add/give credits for any significant changes. That way the maintainer reviews one PR instead of two parallel implementations of the same feature.

Copy link
Copy Markdown
Collaborator

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, thanks @ravyg. Closing as duplicate.

@svlandeg svlandeg closed this Apr 8, 2026
@ravyg
Copy link
Copy Markdown

ravyg commented Apr 8, 2026

Thanks @svlandeg and @MohammadWasi, appreciate it 🙏

ravyg added a commit to ravyg/autoresearch that referenced this pull request Apr 14, 2026
Covers load_results, compute_stats, trajectory states, edge cases,
text report, and save_plot. Uses stdlib unittest only — no new deps.

Credit to @MohammadWasi (karpathy#495) for suggesting test coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants