fix(train): size rotary cache to config.sequence_len, not 10x by lonexreb · Pull Request #557 · karpathy/autoresearch

lonexreb · 2026-04-30T06:20:23Z

Summary

GPT.__init__ allocated the rotary cos / sin buffers at config.sequence_len * 10 entries, but forward() asserts T <= cos.size(1) and T is bounded above by config.sequence_len (the dataloader produces rows of exactly MAX_SEQ_LEN). The upper 90% of the tables was unreachable, just costing GPU memory.

Memory math

At the defaults (head_dim=128, sequence_len=2048, bf16):

	shape	bytes	size
OLD	`(1, 20480, 1, 64)` × 2	5,242,880	5.00 MiB
NEW	`(1, 2048, 1, 64)` × 2	524,288	0.50 MiB
saved	—	4,718,592	4.50 MiB

A small absolute number, but it's a 90% reduction on this buffer, on the every-step hot path. At larger head_dim or sequence_len (which folks do try), the absolute saving scales linearly.

All consumers verified

144:        # comment
148:        self.rotary_seq_len = config.sequence_len    # NEW
149:        cos, sin = self._precompute_rotary_embeddings(self.rotary_seq_len, head_dim)   # __init__
180:        cos, sin = self._precompute_rotary_embeddings(self.rotary_seq_len, head_dim)   # init_weights re-run after to_empty
181:        self.cos, self.sin = cos, sin
274:        assert T <= self.cos.size(1)                 # passes: T <= sequence_len
275:        cos_sin = self.cos[:, :T], self.sin[:, :T]   # only ever slices the first T entries

The forward pass only ever slices [:T]. The assertion still passes for every T the dataloader can produce. Both call sites (__init__ on meta + init_weights after to_empty) read self.rotary_seq_len so a single change covers both.

Test plan

Verified the only self.cos / self.sin consumers are the assertion and the [:T] slice — both safe.
Memory math confirmed via standalone calculation (4.5 MiB saved at defaults).
Optional: run train.py end-to-end and observe peak_vram_mb drops by ~4.5 MiB at the default config — measurable but small relative to model + activations.

Independence from other PRs

Doesn't touch the H100/MFU lines (#547), the PYTORCH_CUDA_ALLOC_CONF env var (#546), the warmup off-by-one (#556), or any data-loading code. No merge conflicts expected.

GPT.__init__ allocated cos/sin buffers at config.sequence_len * 10 entries, but forward() asserts T <= cos.size(1) — and T is sourced from inputs whose length is bounded above by config.sequence_len. So the upper 90% of the cos/sin tables could never be indexed, wasting ~4.5 MiB of GPU memory at the default head_dim=128 / sequence_len=2048 configuration (5.00 MiB -> 0.50 MiB). The same factor was used in init_weights when re-precomputing on the real device, so both call sites pick up the fix via self.rotary_seq_len. Behavior is unchanged: the assertion T <= cos.size(1) still passes for every T the dataloader can produce.

svlandeg

We should review this one

svlandeg reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(train): size rotary cache to config.sequence_len, not 10x#557

fix(train): size rotary cache to config.sequence_len, not 10x#557
lonexreb wants to merge 1 commit into
karpathy:masterfrom
lonexreb:fix/rotary-cache-overallocation

lonexreb commented Apr 30, 2026

Uh oh!

svlandeg left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lonexreb commented Apr 30, 2026

Summary

Memory math

All consumers verified

Test plan

Independence from other PRs

Uh oh!

svlandeg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

svlandeg left a comment •

edited

Loading