No “emergent behavior / aha moment” when retraining GPT-2 on FineWeb; warmup / “warm training” guidance requested #889

talentJay-ux · 2025-10-20T03:17:08Z

talentJay-ux
Oct 20, 2025

Summary
I re-trained a GPT-2-style model from random parameters using HuggingFaceFW/FineWeb dataset. Training/validation loss plateau around ~4 and don’t exhibit a sudden drop; evaluation generations repeat heavily. I’m looking for guidance on “warm training” parameters and optimization suggestions for training the model from scratch

Environment & Model

Model size: GPT-2 124M (12×768, n_heads=12, n_layers=12)
Context length: 1024
Tokenizer: GPT-2 BPE (vocab_size=50304, includes <|endoftext|>)
Init: random
Framework: PyTorch
Precision: bf16-mixed (CUDA Tensor Cores enabled)
Optimizer: AdamW (weight_decay=0.1)
Scheduler: cosine decay with linear warmup
Hardware: h100

Data

Dataset: HuggingFaceFW/fineweb (streaming)
Filter: English only with language_score ≥ 0.9
Chunking: 1024-token blocks, causal next-token prediction, append <|endoftext|> between documents

Dataloader (key settings)

Batch size (minibatch): 64 sequences × 1024 tokens = 65,536 tokens/step
Workers / pin memory: num_workers=4, pin_memory=True
Validation sampling: every 1% of stream (val_mod=100), fixed eval loaders for stable loss

Run stats (this run)

Total optimizer updates: ≈ 29,153 steps
Total tokens seen: ≈ 3.82B tokens
Gradient Accumulation: none (grad_accum=1)
Dropout: 0.1 in attn/MLP/residual dropout

What I observe

Slow progress & plateau: Training and validation loss decrease at first but plateau around ~4 and don’t improve further even after billions of tokens.
Spikes: There are sudden train-loss spikes; after each spike it takes many steps to return to the prior level. My hunch is a data quality / distribution issue (e.g., very long sequences with lots of rare tokens, HTML/code bursts, or bad shards).
Samples repeat: Greedy samples (and even with small temperature) devolve into repetitive loops (“You will be able to see the world …”), with no clear “aha” jumps in coherence.
No emergence: I don’t see non-linear jumps on small reasoning or instruction-following probes—curves are smooth/flat.

Reproduction (minimal)

Model/blocks largely follow Sebastian Raschka’s book “Build a Large Language Model From Scratch”

# Key hyperparameters
GPT_CONFIG_124M = dict(vocab_size=50304, context_length=1024,
                       emb_dim=768, n_heads=12, n_layers=12,
                       drop_rate=0.1, qkv_bias=False)

OTHER_SETTINGS = dict(learning_rate=5e-4,      # peak LR unless warmup overrides
                      batch_size=64,            # minibatch size
                      num_epochs=1,
                      weight_decay=0.1)

# Dataset: HuggingFaceFW/fineweb (streaming), English filter, token blocks of 1024,
# add <|endoftext|>, DataLoader(num_workers=4, pin_memory=True)

(Happy to post a full runnable script if helpful.)

Questions / requests for guidance

A. Gradient normalization

grad_norm_pre  = clip_grad_norm_(model.parameters(), float("inf")).item()
clip_grad_norm_(model.parameters(), max_norm=1.0)
grad_norm_post = clip_grad_norm_(model.parameters(), float("inf")).item()

Still couldn't not avoid the sudden spikes of the loss spikes, do you recommend other methods? For example, I could try to drop the update entirely, if the training loss is too big.

B. Data

Do you suggest me to switch to a different dataset that is better suited for the LLM training than HuggingFaceFW/fineweb? Sorry I didn't pause the training and root cause what exactly happened when the loss suddenly spikes. And I didn't find out which data records exactly caused this.

C. “Emergence” expectations

Is it realistic to expect noticeable “aha-like” jumps from pure pertaining at 124M? Do you think after resolving the sudden loss spikes, with 3.82B tokens seen, we can reproduce a modal matching the performance of GPT-2 124M? My greedy based model faced heavy repetition pattern, even after 3.82B of token seen. And I didn't observe large difference with model trained after 10000 steps of 28000 steps.

D. Warm training suggestions
I would pike up some the model and train on top of it, do you have learning rate suggestions when doing warm training?

E. cross-entropy loss
What would be a good expectation for the loss? Random baseline: ≈ ln(50304) ≈ 10.83 nats. GPT-2 ~124–128M (well-trained on WebText-like): ≈ 3.4–3.8 (PPL ≈ 30–45). Given the above, is ~3.5–3.8 a reasonable validation loss target for GPT-2-124M on FineWeb?

Thank you!

talentJay-ux · 2025-10-20T03:18:19Z

talentJay-ux
Oct 20, 2025
Author

Tensorboard metrics showing the training loss, validation loss, learning rate and token/seen

0 replies

talentJay-ux · 2025-10-20T03:20:14Z

talentJay-ux
Oct 20, 2025
Author

Sample response for model after certain steps, for finishing up the sentence with start context of
Every effort moves you

0 replies

rasbt · 2025-10-21T01:15:38Z

rasbt
Oct 21, 2025
Maintainer

Thanks for sharing this very interesting discussion!

Regarding your points:

A. This looks like a reasonable mod. I have a suggestion below regarding QK-Norm that might additionally help.

B. I would stick with it for now, but maybe in the next run you could print the large gradient/high loss samples to further investigate

C. I am not sure if you'd see it with this small model, but it should match the published GPT-2 I'd say.

D. Usually this is done with re-warming. I briefly wrote about it here [https://magazine.sebastianraschka.com/p/tips-for-llm-pretraining-and-evaluating-rms] based on the Simple and Scalable Strategies to Continually Pre-train Large Language Models paper.

E. Yes, it would be reasonable to expect the same loss. What I would do is to take a sample from a news article that wasn't in the training data and then calculate the loss or perplexity for the base GPT-2 127M and then do this periodically for your trained model. Something that we know could not have been in the training data. For example, from a WSJ article today:

An Amazon Web Services outage hit major sites and apps, disrupting retail sales, social media, financial services and more.
The incident could cost billions in lost sales and cause disruption through supply-chain issues, according to home-delivery service Parcelhero. The AWS infrastructure provides cloud-computing services such as servers and storage to the world’s biggest companies. Among those affected were Facebook, Amazon, Fidelity, Coinbase, Slack, United Airlines, Roblox and WSJ. (But I still wrote this newsletter for you, dear reader.) The outage began at 3 a.m. ET.

This would maybe help to more fairly compare the two models to each other. E.g., via

PS: I am getting

GPT-2 127M

Loss: 3.6041
Perplexity: 36.7497

gpt2-medium (355M)

Loss: 3.3444
Perplexity: 28.3446

Qwen 0.6B Base

Loss: 3.3281
Perplexity: 27.8750

I have a few questions too if you don't mind:

Did you use the vanilla model from Chapter 4 or did you use the optimized version from my bonus materials at https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/10_llm-training-speed?
How long did it approximately take to train on the 3.82B tokens?

Regarding tips:

I agree that these spikes could come from the data. But that being said there are maybe some improvements I would tr..

The important thing, if you have the budget and time, is to try one thing at a time so you can see where the differences come from. Some suggestions are:

F. I would probably remove dropout (or set it to 0.); in my experience it doesn't help and may make things even worse

G. Another one would be to add QK-Norm like in Qwen3 here: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/11_qwen3/standalone-qwen3.ipynb

class MultiHeadAttention(nn.Module):

    def __init__(...):
        if qk_norm:
            self.q_norm = LayerNorm(head_dim)
            self.k_norm = LayerNorm(head_dim)

    def forward(...):
        #...
        queries = self.q_norm(queries)
        keys = self.k_norm(keys)
        attn_scores = queries @ keys.transpose(2, 3)
        #...

Usually, QK-Norm is nowadays implemented with RMSNorm, but for consistency, I would try LayerNorm first. Maybe it gets rid of the spikes.

H. I would also be curious how Qwen3 0.6B performs in terms of smoothness if it is not too large to run. You could technically shrink it by reducing the number of layers. The code there in the notebook should work as a drop-in replacement for the GPT model.

If all that doesn't work, it might be dataset or optimizer and learning rate schedule related. But out of curiosity I would try these things above first. I'd be curious what the results are.

0 replies

talentJay-ux · 2025-11-07T19:03:02Z

talentJay-ux
Nov 7, 2025
Author

Thanks for the prompt reply! And sorry I spent extra time debugging this issue. But now I have a better understanding the issue I am facing.

I implemented a LossSpikeDetector and snapshoter, which is based on the ratio of current inference loss against previous sliding window average loss. From this snapshot model response, it became very clear to me that my training strategy is having an issue, and it is not the data issue. As the sample input looks like:

liability. In an effort to preserve the security and privacy of the citizens of Delaware County, name searching and property photos have been removed from the Real Estate and Tax Records system.By using this information, the User is stating that the above Disclaimer has been read and that he/she has full understanding and is in agreement with the contents.<|endoftext|>The shift to the right, and other tabs Some tabs I’ve saved for later: Canada’s totalitarian shift – a trend that isn’t limited to Canada, if you consider how conservatives have been killing off the liberal arts. You can’t comment on Gazette stories now unless you have a Facebook account which are, the Gazette reassures us, free. We worry about government prying into our privacy but most of us blithely permit media conglomerates open house on our identity and thoughts. Bad move, Gazette. The municipal affairs ministry is clamping down on unpaid taxes by religious groups. In Quebec, considering the vast numbers of convents, churches and other Catholic institutions around the place, that doesn’t seem too surprising, but neither does the flinch when you see that the organization being clamped is a Muslim school. The feds are financing English-language instruction in schools. Not sure whether this is absolutely normal, or a huge scandal. A small group demonstrated Saturday for voting reforms to make our parliaments more directly representational. Good luck, considering it’s rarely if ever in the interest of ruling parties to make these reforms.<|endoftext|>Our philosophy is rooted in a music education program developed more than 40 years ago. The Walden School Musicianship Course forms the cornerstone of our academic program. An organic approach to the study of music, the Musicianship Course encourages students to use the materials they study to create improvisatory works and original

Target response is just 1 word slide right to the input.

And I am able to have a good understanding of what happened. After few hundreds of training, the response looks simple but having some small structures (at step 300)

“nd)) The I)) a as by the) the) 'm to are a lot than and. the, and the. 'm be not of. I be and first. the own. " is a own. a the.. you to out. to the. I. to the a. I are to. and are to.. the. the. the. you. the to I I be the best time. ,. the best. to the is and you. a of the best, of you are a than the is. the are a the I and. you. things. the own a to the time andyear the-based and and the time time to and. be is a longer. we that more the and a to the. and the best and. the. , and, and lot of the and and and have that I., lot of have been out I time time, are that andI'm be have the who.. said not, time time who are. to. and., the, and for the, the, 's a be. and it,, a "'m to,,, I I'm been longer, the and. 'm been be a to the to the, to,, I a, up,.up,up and,,.to,. I a a,-, to,,, 's't know, are a be a, a,, , the best,, I,, and the, not, of the I are be you the,. " I and a've. you and and, a and 'm that, to,, I,,or, I, ,, the lot,, the.., not of,, I a been of the, the. , to,.-. at to lot,,, the best, a bestI.-,, best, the best. a the the, ,., and to, and.,, and,, I,,, and the who to best,, of the,, are't have me best,,, the best,-,and,, best, the best. to's and I's't have to the. time and, the, the is of the. and the a,, the. own, " you're be,. "'ve be,. the time. the lot.. I I a's a lot of be a the. first one. I a the. time's.,year,, the ago the, of a firs”

However, the problematic response look really bad: (at step 1400)

and and and. and and and and and and and and. of and and and and,/ and and and and and for and and and and and and and for and and and and and and and and and and and and and- and and and and and and and - and and and and- and and and and and and and and and and and and and and and and and - and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and - and and and and and and and and and and and and- and and and and and and and and and and and and and and and and for and and and and and and and and and and- and and and and and and and and and and and and and and and and/ and and and and- for and, and and and and and and and and and of and and and and and and and and and - and and tickets and and and and and and/ and and and and and and and and and and/ and and and and and and and and for and/ and and and and and and and and and and and and and and and and

After digging around, I’m pretty sure the collapse was caused by:

applying weight decay to all parameters (including embeddings and LayerNorm),
an overly high learning rate, and
enabling dropout.
came up metrics to track this issue during training. Such as softmax entropy

And if you can provide me more guidance, I would really appreciate it!!

As a result, the model increasingly fixated on a few high-frequency tokens (e.g., repeatedly outputting “and”), i.e., classic mode collapse. And it is not related to data issue. Without fixing this issue, the training couldn't make any further progress.

When cross-entropy is up around 6–8, the model isn’t producing anything meaningful outputs. If I’d been more attentive to that loss level, I wouldn’t have spent so long debugging.

1 reply

rasbt Nov 10, 2025
Maintainer

Thanks for sharing your follow-up results.

applying weight decay to all parameters (including embeddings and LayerNorm),

Good point. Personally, I actually sometimes had better results without weight decay, and this could be related to the issue you are mentioning. I think Olmo2 disables decay on those as well (at least they have a flag for that).

an overly high learning rate, and

Makes sense.

enabling dropout.

Yes, I also think dropout is not necessary, and I can see that in a long training run, weird neuron configurations could trigger spikes.

came up metrics to track this issue during training. Such as softmax entropy

I would potentially see if you could add more layernorms similar to Gemma 3 like @d-kleine mentioned.

talentJay-ux · 2025-11-07T19:07:46Z

talentJay-ux
Nov 7, 2025
Author

As to training speed of your code on H100, I used your code from Chapter 4, it has the speed of nearly 400,000tokens/s. And could fully utilized the H100 power and memory. It took less than 0.5s for a step. And training 3.2B tokens took: 3.2B/400,000 = 8000s = 2.2hours.

There are definitely room for improvement however currently I am working on ensuring the correctness of the model.

Here is the full code I used: https://github.com/talentJay-ux/LLMs-from-scratch/blob/b66c1c9c74a2f06bc612054d030bff0093b693d8/ch05/10_llm-training-speed/03_train_from_scratch.py.

Once I was able to fixed the model collapsed issue, I would definitely try your other recommendations, such as using different norms and position embeddings!

0 replies

casinca · 2025-11-07T21:19:03Z

casinca
Nov 7, 2025

If I may add my 2 cents, besides QK norm that Sebastian mentioned, you can maybe find some ideas to stabilize your training from this paper: https://arxiv.org/abs/2410.16682.

I'm also interested to know what Sebastian think about doing L2 on Layernorm, I had in mind it wasn't a good idea. Qwen, for their Qwen3-Next, used RMSNorm with "zero centered weights" to better adapt it for L2.

Btw since Sebastian mentioned optimizers, a nice variant of Muon that Moonshot used to train their awesome Kimi K2, that could help, is MuonClip (Muon+QK weights rescaling based on max attention logits seen. Can be implemented separately, even with Adam).

In any case, good luck with your training, it's a good hands-on project 👍

1 reply

rasbt Nov 10, 2025
Maintainer

Thanks @casinca

I don't have any experience with/without L2 norm ie weight decay on LayerNorm but I think you are right and I'd try to remove it and see if it helps.

Also:

peak_lr = 0.005

I would probably try even 10x lower.

You could also experiment with z_loss:

https://github.com/allenai/OLMo/blob/59a53e3f68e546330ca5a99ec626e58f9ab96494/olmo/train.py#L124

It helps with large logits in the softmax.

talentJay-ux · 2025-11-08T03:49:51Z

talentJay-ux
Nov 8, 2025
Author

After the latest fine-tune, the model improved and collapse events were significantly reduced. However, I still observed 1–2 small collapses over ~3.2B tokens, especially near the learning-rate peak. Training remains unstable, and the model still isn’t producing high-quality responses after 3.2B tokens. Next, I’ll investigate loss spikes to understand the sudden loss jumps.

Changes made:

Reduced peak learning rate 10×: peak of 5e-3 → 5e-4, initial learning rate of initial_lr = 0.0001 and final learning rate of min_lr = 0.1 * initial_lr isn't changed.
Disabled weight decay on embedding and LayerNorm parameters.
Removed GradScaler (not needed with bfloat16)

I will also monitor the softmax entropy closely.

2 replies

talentJay-ux Nov 8, 2025
Author

Sample output:

d-kleine Nov 10, 2025

Gemma 3 used both pre- and post-norm that might help stabilize training such small model: https://magazine.sebastianraschka.com/i/168650848/normalization-layer-placement-in-gemma

You might also take a look into the OLMO2 paper. It's pretty detailed, especially about stabilizing the training.

No “emergent behavior / aha moment” when retraining GPT-2 on FineWeb; warmup / “warm training” guidance requested #889

Uh oh!

talentJay-ux Oct 20, 2025

Environment & Model

Data

Dataloader (key settings)

Run stats (this run)

What I observe

Reproduction (minimal)

Questions / requests for guidance

Replies: 7 comments · 4 replies

Uh oh!

talentJay-ux Oct 20, 2025 Author

Uh oh!

talentJay-ux Oct 20, 2025 Author

Uh oh!

rasbt Oct 21, 2025 Maintainer

Uh oh!

Uh oh!

talentJay-ux Nov 7, 2025 Author

Uh oh!

rasbt Nov 10, 2025 Maintainer

Uh oh!

Uh oh!

talentJay-ux Nov 7, 2025 Author

Uh oh!

casinca Nov 7, 2025

Uh oh!

rasbt Nov 10, 2025 Maintainer

Uh oh!

Uh oh!

talentJay-ux Nov 8, 2025 Author

Uh oh!

Uh oh!

talentJay-ux Nov 8, 2025 Author

Uh oh!

d-kleine Nov 10, 2025

talentJay-ux
Oct 20, 2025

Replies: 7 comments 4 replies

talentJay-ux
Oct 20, 2025
Author

talentJay-ux
Oct 20, 2025
Author

rasbt
Oct 21, 2025
Maintainer

talentJay-ux
Nov 7, 2025
Author

rasbt Nov 10, 2025
Maintainer

talentJay-ux
Nov 7, 2025
Author

casinca
Nov 7, 2025

rasbt Nov 10, 2025
Maintainer

talentJay-ux
Nov 8, 2025
Author

talentJay-ux Nov 8, 2025
Author