docs: guide for evaluating experiment results at scale by dean0x · Pull Request #303 · karpathy/autoresearch

dean0x · 2026-03-16T19:37:57Z

After running about 100 experiments I kept squinting at results.tsv trying to figure out which improvements were real. Wrote up what I learned about noise floor estimation and Pareto efficiency for the autoresearch metric format.

This doesn't add any tooling or change any code. Just a standalone guide in docs/ for people who have accumulated enough experiments that eyeballing val_bpb deltas stops working.

Happy to adjust scope or placement if you'd prefer this somewhere else.

…context mgmt, low-VRAM, eval guide PR karpathy#291 — Data integrity verification for downloads Adds Content-Length size verification and Parquet metadata validation (pq.read_metadata) before committing downloaded shards. Catches truncated or corrupted files from network interruptions before they get sealed with a SHA-256 hash. Layered on top of our existing atomic .tmp rename and SHA-256 sidecar verification. PR karpathy#282 — Bake reflection into the experiment loop Adds musings.md initialization to setup, plus pre-experiment rationale (step 2: explain the idea and its ML grounding) and post-experiment reflection (step 9: record outcome and interpretation). Leaves a learning trail for humans and may improve agent idea generation quality. Issue karpathy#298 — Subagent delegation for context window preservation Adds a "Context management" section to program.md with a subagent prompt template. The main agent holds research state; subagents handle mechanical steps (commit, train, extract metrics). Verbose output dies with the subagent, keeping the primary context clean over 50+ experiment runs. PR karpathy#299 — Low-VRAM auto-detection (cherry-picked universal parts) Adds VRAM detection: GPUs with < 6GB automatically get reduced hyperparameters (batch=32, seq=256, depth=4, SSSL window pattern). Introduces TRAIN_SEQ_LEN variable used throughout model config, dataloader, and evaluation. Also adds seq_len and max_steps optional parameters to evaluate_bpb() for flexible eval on constrained hardware. Skipped: hardware-specific torch/kernels downgrades, 1050 Ti tuning. PR karpathy#303 — Guide for evaluating experiment results at scale New docs/evaluating-results.md covering noise floor estimation (awk one-liner for median pairwise delta), when to trust an improvement (1.5x noise floor rule), Pareto efficiency analysis, and useful one-liners for results.tsv at scale. Optional: PR karpathy#276 — Deterministic keep/discard policy engine Standalone contrib/policy_engine.py (60 lines) + test suite (9 tests). Evaluates experiments by val_bpb improvement vs complexity tradeoff. NOT wired into the training loop — available as an optional decision aid. Placed in contrib/ to signal its optional nature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- run_suite.py: orchestrates full dataset sweep with profile management, per-dataset data/tokenizer isolation, and sequential agent runs - compare_datasets.py: cross-dataset analysis with convergence curves, Pareto frontier charts, and optimal hyperparameter comparison - docs/evaluating-results.md: adapted from karpathy/autoresearch PR karpathy#303 (Dean Sharon) for noise floor estimation and results-at-scale analysis - convert_dataset.py: add cosmopedia-v2, slimpajama, python-edu datasets - results/fineweb-edu/: 88-experiment agent run results (best 1.3424) - .gitignore: track results/<dataset>/results.tsv, ignore root results.tsv Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds run_suite.py for orchestrating experiments across datasets, compare_datasets.py for cross-dataset analysis and visualization, evaluating-results guide (adapted from karpathy/autoresearch PR karpathy#303), FineWeb-Edu results, and --tag=value argument parsing fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Dean Sharon added 2 commits March 16, 2026 21:37

docs: guide for evaluating experiment results at scale

5874a5c

docs: narrow jitter claim to GPU nondeterminism only

5d47837

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: guide for evaluating experiment results at scale#303

docs: guide for evaluating experiment results at scale#303
dean0x wants to merge 2 commits intokarpathy:masterfrom
dean0x:docs/evaluating-results

dean0x commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dean0x commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant