feat(autoresearch): Implement keep/discard policy engine by nmandal · Pull Request #276 · karpathy/autoresearch

nmandal · 2026-03-15T04:47:12Z

Implements a deterministic policy engine for evaluating autoresearch results. Resolves NIC-320.

Implements a deterministic policy engine for evaluating autoresearch results based on validation bits-per-byte (val_bpb) and a complexity score. The core logic is to prioritize candidates with lower val_bpb, while also favoring simplicity. The engine handles crash and timeout statuses explicitly. Includes a comprehensive unit test suite to validate the decision logic under various conditions. Resolves NIC-320.

…context mgmt, low-VRAM, eval guide PR karpathy#291 — Data integrity verification for downloads Adds Content-Length size verification and Parquet metadata validation (pq.read_metadata) before committing downloaded shards. Catches truncated or corrupted files from network interruptions before they get sealed with a SHA-256 hash. Layered on top of our existing atomic .tmp rename and SHA-256 sidecar verification. PR karpathy#282 — Bake reflection into the experiment loop Adds musings.md initialization to setup, plus pre-experiment rationale (step 2: explain the idea and its ML grounding) and post-experiment reflection (step 9: record outcome and interpretation). Leaves a learning trail for humans and may improve agent idea generation quality. Issue karpathy#298 — Subagent delegation for context window preservation Adds a "Context management" section to program.md with a subagent prompt template. The main agent holds research state; subagents handle mechanical steps (commit, train, extract metrics). Verbose output dies with the subagent, keeping the primary context clean over 50+ experiment runs. PR karpathy#299 — Low-VRAM auto-detection (cherry-picked universal parts) Adds VRAM detection: GPUs with < 6GB automatically get reduced hyperparameters (batch=32, seq=256, depth=4, SSSL window pattern). Introduces TRAIN_SEQ_LEN variable used throughout model config, dataloader, and evaluation. Also adds seq_len and max_steps optional parameters to evaluate_bpb() for flexible eval on constrained hardware. Skipped: hardware-specific torch/kernels downgrades, 1050 Ti tuning. PR karpathy#303 — Guide for evaluating experiment results at scale New docs/evaluating-results.md covering noise floor estimation (awk one-liner for median pairwise delta), when to trust an improvement (1.5x noise floor rule), Pareto efficiency analysis, and useful one-liners for results.tsv at scale. Optional: PR karpathy#276 — Deterministic keep/discard policy engine Standalone contrib/policy_engine.py (60 lines) + test suite (9 tests). Evaluates experiments by val_bpb improvement vs complexity tradeoff. NOT wired into the training loop — available as an optional decision aid. Placed in contrib/ to signal its optional nature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(autoresearch): Implement keep/discard policy engine#276

feat(autoresearch): Implement keep/discard policy engine#276
nmandal wants to merge 1 commit intokarpathy:masterfrom
nmandal:feature/nic-320-policy-engine

nmandal commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nmandal commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant