Fix: add data integrity verification to downloader to prevent corrupted shardsfeat: add data integrity verification to downloader by GhostDragon124 · Pull Request #291 · karpathy/autoresearch

GhostDragon124 · 2026-03-16T13:31:31Z

The Problem: Current downloader assumes successful transfers if the file exists on disk. However, network interruptions can lead to truncated .parquet files, causing pyarrow.lib.ArrowInvalid or UnicodeDecodeError during the preprocessing stage.

The Solution:

Implemented Atomic Renaming: Use .tmp files during download to prevent partial files from being treated as complete.

Added Integrity Validation: Use pyarrow.parquet.read_metadata to verify the file structure before finalizing the download.

Added Length Verification: Compare Content-Length headers with actual local file size.

Impact: Enhances the robustness of the data pipeline, especially for users with unstable network connections.

…context mgmt, low-VRAM, eval guide PR karpathy#291 — Data integrity verification for downloads Adds Content-Length size verification and Parquet metadata validation (pq.read_metadata) before committing downloaded shards. Catches truncated or corrupted files from network interruptions before they get sealed with a SHA-256 hash. Layered on top of our existing atomic .tmp rename and SHA-256 sidecar verification. PR karpathy#282 — Bake reflection into the experiment loop Adds musings.md initialization to setup, plus pre-experiment rationale (step 2: explain the idea and its ML grounding) and post-experiment reflection (step 9: record outcome and interpretation). Leaves a learning trail for humans and may improve agent idea generation quality. Issue karpathy#298 — Subagent delegation for context window preservation Adds a "Context management" section to program.md with a subagent prompt template. The main agent holds research state; subagents handle mechanical steps (commit, train, extract metrics). Verbose output dies with the subagent, keeping the primary context clean over 50+ experiment runs. PR karpathy#299 — Low-VRAM auto-detection (cherry-picked universal parts) Adds VRAM detection: GPUs with < 6GB automatically get reduced hyperparameters (batch=32, seq=256, depth=4, SSSL window pattern). Introduces TRAIN_SEQ_LEN variable used throughout model config, dataloader, and evaluation. Also adds seq_len and max_steps optional parameters to evaluate_bpb() for flexible eval on constrained hardware. Skipped: hardware-specific torch/kernels downgrades, 1050 Ti tuning. PR karpathy#303 — Guide for evaluating experiment results at scale New docs/evaluating-results.md covering noise floor estimation (awk one-liner for median pairwise delta), when to trust an improvement (1.5x noise floor rule), Pareto efficiency analysis, and useful one-liners for results.tsv at scale. Optional: PR karpathy#276 — Deterministic keep/discard policy engine Standalone contrib/policy_engine.py (60 lines) + test suite (9 tests). Evaluates experiments by val_bpb improvement vs complexity tradeoff. NOT wired into the training loop — available as an optional decision aid. Placed in contrib/ to signal its optional nature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add data integrity verification to downloader

30cf204

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: add data integrity verification to downloader to prevent corrupted shardsfeat: add data integrity verification to downloader#291

Fix: add data integrity verification to downloader to prevent corrupted shardsfeat: add data integrity verification to downloader#291
GhostDragon124 wants to merge 1 commit intokarpathy:masterfrom
GhostDragon124:master

GhostDragon124 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GhostDragon124 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant