Skip to content

[Feature request]: Eliminate pre-attention RMSNorm in Gemma 4 via scale invariance + weight folding #638

@NilsGraf

Description

@NilsGraf

Due to the scale invariance of RMS, an RMSNorm layer followed by a linear projection followed by another RMSNorm allows the first RMSNorm to be eliminated entirely — a mathematically lossless simplification. In models that use QKV-normalization (such as Gemma 4), this means the pre-attention RMSNorm can be removed with no change to model outputs, see FlashNorm paper.

Image

However, the pre-attention norm's learned weights are still needed. These can be eliminated cleanly by folding them into the QKV projection weights using the FlashNorm weight-folding trick — again with no loss in model accuracy.

Image

For reference, we have applied this weight folding trick to a few LLMs (Llama, Qwen, SMolLM) here:
https://huggingface.co/models?other=weightless-rmsnorm

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions