fix(train): size rotary cache to config.sequence_len, not 10x#557
Open
lonexreb wants to merge 1 commit into
Open
fix(train): size rotary cache to config.sequence_len, not 10x#557lonexreb wants to merge 1 commit into
lonexreb wants to merge 1 commit into
Conversation
GPT.__init__ allocated cos/sin buffers at config.sequence_len * 10 entries, but forward() asserts T <= cos.size(1) — and T is sourced from inputs whose length is bounded above by config.sequence_len. So the upper 90% of the cos/sin tables could never be indexed, wasting ~4.5 MiB of GPU memory at the default head_dim=128 / sequence_len=2048 configuration (5.00 MiB -> 0.50 MiB). The same factor was used in init_weights when re-precomputing on the real device, so both call sites pick up the fix via self.rotary_seq_len. Behavior is unchanged: the assertion T <= cos.size(1) still passes for every T the dataloader can produce.
svlandeg
reviewed
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GPT.__init__allocated the rotarycos/sinbuffers atconfig.sequence_len * 10entries, butforward()assertsT <= cos.size(1)andTis bounded above byconfig.sequence_len(the dataloader produces rows of exactlyMAX_SEQ_LEN). The upper 90% of the tables was unreachable, just costing GPU memory.Memory math
At the defaults (
head_dim=128,sequence_len=2048, bf16):(1, 20480, 1, 64)× 2(1, 2048, 1, 64)× 2A small absolute number, but it's a 90% reduction on this buffer, on the every-step hot path. At larger
head_dimorsequence_len(which folks do try), the absolute saving scales linearly.All consumers verified
The forward pass only ever slices
[:T]. The assertion still passes for everyTthe dataloader can produce. Both call sites (__init__on meta +init_weightsafterto_empty) readself.rotary_seq_lenso a single change covers both.Test plan
self.cos/self.sinconsumers are the assertion and the[:T]slice — both safe.train.pyend-to-end and observepeak_vram_mbdrops by ~4.5 MiB at the default config — measurable but small relative to model + activations.Independence from other PRs
Doesn't touch the H100/MFU lines (#547), the
PYTORCH_CUDA_ALLOC_CONFenv var (#546), the warmup off-by-one (#556), or any data-loading code. No merge conflicts expected.