docs: guide for evaluating experiment results at scale#303
Open
dean0x wants to merge 2 commits intokarpathy:masterfrom
Open
docs: guide for evaluating experiment results at scale#303dean0x wants to merge 2 commits intokarpathy:masterfrom
dean0x wants to merge 2 commits intokarpathy:masterfrom
Conversation
IgorTavcar
added a commit
to IgorTavcar/autoresearch
that referenced
this pull request
Mar 17, 2026
…context mgmt, low-VRAM, eval guide PR karpathy#291 — Data integrity verification for downloads Adds Content-Length size verification and Parquet metadata validation (pq.read_metadata) before committing downloaded shards. Catches truncated or corrupted files from network interruptions before they get sealed with a SHA-256 hash. Layered on top of our existing atomic .tmp rename and SHA-256 sidecar verification. PR karpathy#282 — Bake reflection into the experiment loop Adds musings.md initialization to setup, plus pre-experiment rationale (step 2: explain the idea and its ML grounding) and post-experiment reflection (step 9: record outcome and interpretation). Leaves a learning trail for humans and may improve agent idea generation quality. Issue karpathy#298 — Subagent delegation for context window preservation Adds a "Context management" section to program.md with a subagent prompt template. The main agent holds research state; subagents handle mechanical steps (commit, train, extract metrics). Verbose output dies with the subagent, keeping the primary context clean over 50+ experiment runs. PR karpathy#299 — Low-VRAM auto-detection (cherry-picked universal parts) Adds VRAM detection: GPUs with < 6GB automatically get reduced hyperparameters (batch=32, seq=256, depth=4, SSSL window pattern). Introduces TRAIN_SEQ_LEN variable used throughout model config, dataloader, and evaluation. Also adds seq_len and max_steps optional parameters to evaluate_bpb() for flexible eval on constrained hardware. Skipped: hardware-specific torch/kernels downgrades, 1050 Ti tuning. PR karpathy#303 — Guide for evaluating experiment results at scale New docs/evaluating-results.md covering noise floor estimation (awk one-liner for median pairwise delta), when to trust an improvement (1.5x noise floor rule), Pareto efficiency analysis, and useful one-liners for results.tsv at scale. Optional: PR karpathy#276 — Deterministic keep/discard policy engine Standalone contrib/policy_engine.py (60 lines) + test suite (9 tests). Evaluates experiments by val_bpb improvement vs complexity tradeoff. NOT wired into the training loop — available as an optional decision aid. Placed in contrib/ to signal its optional nature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
elementalcollision
pushed a commit
to elementalcollision/autoresearch
that referenced
this pull request
Mar 17, 2026
- run_suite.py: orchestrates full dataset sweep with profile management, per-dataset data/tokenizer isolation, and sequential agent runs - compare_datasets.py: cross-dataset analysis with convergence curves, Pareto frontier charts, and optimal hyperparameter comparison - docs/evaluating-results.md: adapted from karpathy/autoresearch PR karpathy#303 (Dean Sharon) for noise floor estimation and results-at-scale analysis - convert_dataset.py: add cosmopedia-v2, slimpajama, python-edu datasets - results/fineweb-edu/: 88-experiment agent run results (best 1.3424) - .gitignore: track results/<dataset>/results.tsv, ignore root results.tsv Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
elementalcollision
pushed a commit
to elementalcollision/autoresearch
that referenced
this pull request
Mar 18, 2026
Adds run_suite.py for orchestrating experiments across datasets, compare_datasets.py for cross-dataset analysis and visualization, evaluating-results guide (adapted from karpathy/autoresearch PR karpathy#303), FineWeb-Edu results, and --tag=value argument parsing fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After running about 100 experiments I kept squinting at results.tsv trying to figure out which improvements were real. Wrote up what I learned about noise floor estimation and Pareto efficiency for the autoresearch metric format.
This doesn't add any tooling or change any code. Just a standalone guide in docs/ for people who have accumulated enough experiments that eyeballing val_bpb deltas stops working.
Happy to adjust scope or placement if you'd prefer this somewhere else.