Train: add gradient clipping before optimizer step#287
Open
renee-jia wants to merge 1 commit intokarpathy:masterfrom
Open
Train: add gradient clipping before optimizer step#287renee-jia wants to merge 1 commit intokarpathy:masterfrom
renee-jia wants to merge 1 commit intokarpathy:masterfrom
Conversation
relu² activations can produce gradient spikes that silently degrade model weights. The existing fast-fail (loss > 100) only catches damage after it has already happened. Clipping gradients prevents wasted 5-minute experiment runs. Adds GRAD_CLIP_NORM hyperparameter (default 1.0, set 0.0 to disable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
IgorTavcar
added a commit
to IgorTavcar/autoresearch
that referenced
this pull request
Mar 17, 2026
…iler, UCB1 search PR karpathy#287 — Gradient clipping before optimizer step (baseline/train.py) Adds GRAD_CLIP_NORM hyperparameter (default 1.0, set 0.0 to disable). relu² activations can produce gradient spikes that silently degrade weights. The existing loss > 100 fast-fail only catches damage after it has already happened. Clipping prevents wasted experiment runs. PR karpathy#279 — --profile flag for LLM-readable CUDA kernel summary (baseline/train.py) Adds argparse --profile flag that runs torch.profiler over a few warmup steps and prints a Markdown table of top CUDA kernels by self-time, then exits. Lets the agent identify hardware bottlenecks (attention vs MLP vs elementwise) without needing trace visualization tools. Usage: uv run baseline/train.py --profile Issue karpathy#284 — DUSE alt program (baseline/program-alt.md) Alternative program.md integrating Dimensional UCB1 Search + Experiment Memory from issue karpathy#284. Adds: 7-dimension map, experiments.json structured memory, UCB1 dimension selector (exploration vs exploitation), 90-second early abort gate, rescue pool for recombining discarded sub-mechanisms. Pure prompt change, no code modifications required. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
relu² activations can produce gradient spikes that silently degrade model weights. The existing fast-fail (loss > 100) only catches damage after it has already happened. Clipping gradients prevents wasted time long experiment runs.
Adds GRAD_CLIP_NORM hyperparameter (default 1.0, set 0.0 to disable).