fix(train): guard against zero grad_accum_steps#264
Open
jluk wants to merge 1 commit intokarpathy:masterfrom
Open
fix(train): guard against zero grad_accum_steps#264jluk wants to merge 1 commit intokarpathy:masterfrom
jluk wants to merge 1 commit intokarpathy:masterfrom
Conversation
If TOTAL_BATCH_SIZE is smaller than a single forward pass (DEVICE_BATCH_SIZE * MAX_SEQ_LEN), grad_accum_steps silently becomes 0, causing the training loop to skip all gradient accumulation. Add an assertion to fail fast with a clear message instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
IgorTavcar
added a commit
to IgorTavcar/autoresearch
that referenced
this pull request
Mar 14, 2026
…rity hardening Cherry-picked improvements from open PRs on karpathy/autoresearch: PR karpathy#265 — Save checkpoint before eval to survive crashes Applied to baseline/train.py (torch.save) and train_mlx.py (mx.save_safetensors). If eval OOMs or crashes, training work is preserved. PR karpathy#264 — Guard against zero grad_accum_steps Added assertion after computing grad_accum_steps. Catches silent misconfiguration when TOTAL_BATCH_SIZE < DEVICE_BATCH_SIZE * MAX_SEQ_LEN. PR karpathy#188 — Add helpful error messages to bare asserts Window pattern and batch size divisibility asserts now explain what went wrong instead of a bare AssertionError. PR karpathy#185 — Print startup_seconds in final summary Useful diagnostic for measuring compilation/init overhead across backends. PR karpathy#138 — Make DEVICE_BATCH_SIZE configurable via env var Avoids source code edits when switching between Apple Silicon tiers: DEVICE_BATCH_SIZE=16 uv run baseline/train.py PR karpathy#216 — SHA-256 verification for cached data shards Each downloaded shard gets a .sha256 sidecar file. On reuse, integrity is verified and corrupted shards are re-downloaded. Uses os.replace() for atomic writes instead of os.rename(). PR karpathy#237 — Harden tokenizer deserialization (pickle → JSON+base64) Replaces unsafe pickle.load with JSON serialization using base64-encoded mergeable_ranks. Legacy tokenizer.pkl is detected and rejected with a clear migration message. Eliminates arbitrary code execution risk from cache poisoning attacks on the tokenizer file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
If TOTAL_BATCH_SIZE is smaller than a single forward pass (DEVICE_BATCH_SIZE * MAX_SEQ_LEN), grad_accum_steps silently becomes 0, causing the training loop to skip all gradient accumulation.
Add an assertion to fail fast with a clear message instead.