[Feature request]: Eliminate pre-attention RMSNorm in Gemma 4 via scale invariance + weight folding

Due to the scale invariance of RMS, an RMSNorm layer followed by a linear projection followed by another RMSNorm allows the first RMSNorm to be eliminated entirely — a mathematically lossless simplification. In models that use QKV-normalization (such as Gemma 4), this means the pre-attention RMSNorm can be removed with no change to model outputs, see [FlashNorm paper](https://arxiv.org/pdf/2407.09577).

<img width="589" height="205" alt="Image" src="https://github.com/user-attachments/assets/6fcb9e55-71c6-4dc1-8ed7-dcf03e4de902" />

However, the pre-attention norm's learned weights are still needed. These can be eliminated cleanly by folding them into the QKV projection weights using the [FlashNorm](https://arxiv.org/pdf/2407.09577) weight-folding trick — again with no loss in model accuracy.

<img width="332" height="183" alt="Image" src="https://github.com/user-attachments/assets/d97b50ba-1092-4d44-ad70-ff2bca448b1d" />

For reference, we have applied this weight folding trick to a few LLMs (Llama, Qwen, SMolLM) here:
https://huggingface.co/models?other=weightless-rmsnorm 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request]: Eliminate pre-attention RMSNorm in Gemma 4 via scale invariance + weight folding #638

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature request]: Eliminate pre-attention RMSNorm in Gemma 4 via scale invariance + weight folding #638

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions