Skip to content

fix(train): size rotary cache to config.sequence_len, not 10x#557

Open
lonexreb wants to merge 1 commit into
karpathy:masterfrom
lonexreb:fix/rotary-cache-overallocation
Open

fix(train): size rotary cache to config.sequence_len, not 10x#557
lonexreb wants to merge 1 commit into
karpathy:masterfrom
lonexreb:fix/rotary-cache-overallocation

Conversation

@lonexreb
Copy link
Copy Markdown

Summary

GPT.__init__ allocated the rotary cos / sin buffers at config.sequence_len * 10 entries, but forward() asserts T <= cos.size(1) and T is bounded above by config.sequence_len (the dataloader produces rows of exactly MAX_SEQ_LEN). The upper 90% of the tables was unreachable, just costing GPU memory.

Memory math

At the defaults (head_dim=128, sequence_len=2048, bf16):

shape bytes size
OLD (1, 20480, 1, 64) × 2 5,242,880 5.00 MiB
NEW (1, 2048, 1, 64) × 2 524,288 0.50 MiB
saved 4,718,592 4.50 MiB

A small absolute number, but it's a 90% reduction on this buffer, on the every-step hot path. At larger head_dim or sequence_len (which folks do try), the absolute saving scales linearly.

All consumers verified

144:        # comment
148:        self.rotary_seq_len = config.sequence_len    # NEW
149:        cos, sin = self._precompute_rotary_embeddings(self.rotary_seq_len, head_dim)   # __init__
180:        cos, sin = self._precompute_rotary_embeddings(self.rotary_seq_len, head_dim)   # init_weights re-run after to_empty
181:        self.cos, self.sin = cos, sin
274:        assert T <= self.cos.size(1)                 # passes: T <= sequence_len
275:        cos_sin = self.cos[:, :T], self.sin[:, :T]   # only ever slices the first T entries

The forward pass only ever slices [:T]. The assertion still passes for every T the dataloader can produce. Both call sites (__init__ on meta + init_weights after to_empty) read self.rotary_seq_len so a single change covers both.

Test plan

  • Verified the only self.cos / self.sin consumers are the assertion and the [:T] slice — both safe.
  • Memory math confirmed via standalone calculation (4.5 MiB saved at defaults).
  • Optional: run train.py end-to-end and observe peak_vram_mb drops by ~4.5 MiB at the default config — measurable but small relative to model + activations.

Independence from other PRs

Doesn't touch the H100/MFU lines (#547), the PYTORCH_CUDA_ALLOC_CONF env var (#546), the warmup off-by-one (#556), or any data-loading code. No merge conflicts expected.

GPT.__init__ allocated cos/sin buffers at config.sequence_len * 10
entries, but forward() asserts T <= cos.size(1) — and T is sourced
from inputs whose length is bounded above by config.sequence_len. So
the upper 90% of the cos/sin tables could never be indexed, wasting
~4.5 MiB of GPU memory at the default head_dim=128 / sequence_len=2048
configuration (5.00 MiB -> 0.50 MiB).

The same factor was used in init_weights when re-precomputing on the
real device, so both call sites pick up the fix via self.rotary_seq_len.

Behavior is unchanged: the assertion T <= cos.size(1) still passes for
every T the dataloader can produce.
Copy link
Copy Markdown
Collaborator

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should review this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants