Skip to content

Fix: add data integrity verification to downloader to prevent corrupted shardsfeat: add data integrity verification to downloader#291

Open
GhostDragon124 wants to merge 1 commit intokarpathy:masterfrom
GhostDragon124:master
Open

Fix: add data integrity verification to downloader to prevent corrupted shardsfeat: add data integrity verification to downloader#291
GhostDragon124 wants to merge 1 commit intokarpathy:masterfrom
GhostDragon124:master

Conversation

@GhostDragon124
Copy link

The Problem: Current downloader assumes successful transfers if the file exists on disk. However, network interruptions can lead to truncated .parquet files, causing pyarrow.lib.ArrowInvalid or UnicodeDecodeError during the preprocessing stage.

The Solution:

Implemented Atomic Renaming: Use .tmp files during download to prevent partial files from being treated as complete.

Added Integrity Validation: Use pyarrow.parquet.read_metadata to verify the file structure before finalizing the download.

Added Length Verification: Compare Content-Length headers with actual local file size.

Impact: Enhances the robustness of the data pipeline, especially for users with unstable network connections.

IgorTavcar added a commit to IgorTavcar/autoresearch that referenced this pull request Mar 17, 2026
…context mgmt, low-VRAM, eval guide

PR karpathy#291 — Data integrity verification for downloads
  Adds Content-Length size verification and Parquet metadata validation
  (pq.read_metadata) before committing downloaded shards. Catches truncated
  or corrupted files from network interruptions before they get sealed with
  a SHA-256 hash. Layered on top of our existing atomic .tmp rename and
  SHA-256 sidecar verification.

PR karpathy#282 — Bake reflection into the experiment loop
  Adds musings.md initialization to setup, plus pre-experiment rationale
  (step 2: explain the idea and its ML grounding) and post-experiment
  reflection (step 9: record outcome and interpretation). Leaves a learning
  trail for humans and may improve agent idea generation quality.

Issue karpathy#298 — Subagent delegation for context window preservation
  Adds a "Context management" section to program.md with a subagent prompt
  template. The main agent holds research state; subagents handle mechanical
  steps (commit, train, extract metrics). Verbose output dies with the
  subagent, keeping the primary context clean over 50+ experiment runs.

PR karpathy#299 — Low-VRAM auto-detection (cherry-picked universal parts)
  Adds VRAM detection: GPUs with < 6GB automatically get reduced
  hyperparameters (batch=32, seq=256, depth=4, SSSL window pattern).
  Introduces TRAIN_SEQ_LEN variable used throughout model config,
  dataloader, and evaluation. Also adds seq_len and max_steps optional
  parameters to evaluate_bpb() for flexible eval on constrained hardware.
  Skipped: hardware-specific torch/kernels downgrades, 1050 Ti tuning.

PR karpathy#303 — Guide for evaluating experiment results at scale
  New docs/evaluating-results.md covering noise floor estimation (awk
  one-liner for median pairwise delta), when to trust an improvement
  (1.5x noise floor rule), Pareto efficiency analysis, and useful
  one-liners for results.tsv at scale.

Optional: PR karpathy#276 — Deterministic keep/discard policy engine
  Standalone contrib/policy_engine.py (60 lines) + test suite (9 tests).
  Evaluates experiments by val_bpb improvement vs complexity tradeoff.
  NOT wired into the training loop — available as an optional decision
  aid. Placed in contrib/ to signal its optional nature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant