Skip to content

fix(train): guard against zero grad_accum_steps#264

Open
jluk wants to merge 1 commit intokarpathy:masterfrom
jluk:fix/guard-grad-accum-steps
Open

fix(train): guard against zero grad_accum_steps#264
jluk wants to merge 1 commit intokarpathy:masterfrom
jluk:fix/guard-grad-accum-steps

Conversation

@jluk
Copy link

@jluk jluk commented Mar 14, 2026

If TOTAL_BATCH_SIZE is smaller than a single forward pass (DEVICE_BATCH_SIZE * MAX_SEQ_LEN), grad_accum_steps silently becomes 0, causing the training loop to skip all gradient accumulation.

Add an assertion to fail fast with a clear message instead.

If TOTAL_BATCH_SIZE is smaller than a single forward pass
(DEVICE_BATCH_SIZE * MAX_SEQ_LEN), grad_accum_steps silently becomes 0,
causing the training loop to skip all gradient accumulation. Add an
assertion to fail fast with a clear message instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
IgorTavcar added a commit to IgorTavcar/autoresearch that referenced this pull request Mar 14, 2026
…rity hardening

Cherry-picked improvements from open PRs on karpathy/autoresearch:

PR karpathy#265 — Save checkpoint before eval to survive crashes
  Applied to baseline/train.py (torch.save) and train_mlx.py (mx.save_safetensors).
  If eval OOMs or crashes, training work is preserved.

PR karpathy#264 — Guard against zero grad_accum_steps
  Added assertion after computing grad_accum_steps. Catches silent
  misconfiguration when TOTAL_BATCH_SIZE < DEVICE_BATCH_SIZE * MAX_SEQ_LEN.

PR karpathy#188 — Add helpful error messages to bare asserts
  Window pattern and batch size divisibility asserts now explain what went
  wrong instead of a bare AssertionError.

PR karpathy#185 — Print startup_seconds in final summary
  Useful diagnostic for measuring compilation/init overhead across backends.

PR karpathy#138 — Make DEVICE_BATCH_SIZE configurable via env var
  Avoids source code edits when switching between Apple Silicon tiers:
  DEVICE_BATCH_SIZE=16 uv run baseline/train.py

PR karpathy#216 — SHA-256 verification for cached data shards
  Each downloaded shard gets a .sha256 sidecar file. On reuse, integrity
  is verified and corrupted shards are re-downloaded. Uses os.replace()
  for atomic writes instead of os.rename().

PR karpathy#237 — Harden tokenizer deserialization (pickle → JSON+base64)
  Replaces unsafe pickle.load with JSON serialization using base64-encoded
  mergeable_ranks. Legacy tokenizer.pkl is detected and rejected with a
  clear migration message. Eliminates arbitrary code execution risk from
  cache poisoning attacks on the tokenizer file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant