[PyTorch] Enable generic QK norm support (+ RMSNorm/LayerNorm) #1966

negvet · 2025-07-18T13:56:10Z

Description

For the training stabilization purposes, QK tensors might be normalized.

Enables generic QK norm support.
Enables RMSNorm/LayerNorm as a normalization types (in addition to L2Normalization).
Enables qk normalization before and after RoPE (following both Qwen and Llama approaches).

Extention of #1864
Fixes #1958

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Evgeny <[email protected]>

for more information, see https://pre-commit.ci

Marks101

Great, thanks for looking into this so quickly. This all looks good, but:

I have to admit that there is one more thing that I just noticed. The Qwen3 models apply QK normalization before RoPE, see here, this is in contrast to this implementation which is based on LLama4.
I was not aware that there are two different formulations for this. Sorry for that.

transformer_engine/pytorch/attention/multi_head_attention.py

Signed-off-by: Evgeny <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Evgeny <[email protected]>

negvet · 2025-07-22T09:04:17Z

I have to admit that there is one more thing that I just noticed. The Qwen3 models apply QK normalization before RoPE, see here, this is in contrast to this implementation which is based on LLama4. I was not aware that there are two different formulations for this. Sorry for that.

This flexibility might be worth it; I supported it.
It brings some complexity and additional argument, but it is manageable, I guess.
Let's give it a review.

negvet · 2025-07-22T09:04:40Z

/te-ci pytorch

Marks101

Great, that was super quick! This looks very good to me 🥳

negvet · 2025-07-24T10:06:58Z

/te-ci pytorch

timmoon10

LGTM

negvet and others added 2 commits July 18, 2025 13:53

Support RMSNorm for QK

9161a0e

Signed-off-by: Evgeny <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

bcb5509

for more information, see https://pre-commit.ci

Marks101 reviewed Jul 22, 2025

View reviewed changes

transformer_engine/pytorch/attention/multi_head_attention.py Show resolved Hide resolved

negvet and others added 5 commits July 22, 2025 07:35

rms -> RMSNorm, l2 -> L2Normalization (align with current pattern)

1c0703d

Signed-off-by: Evgeny <[email protected]>

Support LayerNorm + init refactor

d647662

Signed-off-by: Evgeny <[email protected]>

Before/after RoPE

2febdcc

Signed-off-by: Evgeny <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

27984b8

for more information, see https://pre-commit.ci

Merge branch 'main' into et/support_qk_rms

e2cb2ae

negvet changed the title ~~Support RMSNorm for QK~~ [PyTorch] Enable generic QK norm support (+ RMSNorm/LayerNorm) Jul 22, 2025

Fix pylint

1089dff

Signed-off-by: Evgeny <[email protected]>

negvet marked this pull request as ready for review July 22, 2025 09:05

negvet requested a review from Marks101 July 22, 2025 09:05

Marks101 approved these changes Jul 23, 2025

View reviewed changes

Merge branch 'main' into et/support_qk_rms

88ddf8f

timmoon10 approved these changes Jul 25, 2025

View reviewed changes

Merge branch 'main' into et/support_qk_rms

57194f1

negvet merged commit 374849e into NVIDIA:main Jul 25, 2025
10 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PyTorch] Enable generic QK norm support (+ RMSNorm/LayerNorm) #1966

[PyTorch] Enable generic QK norm support (+ RMSNorm/LayerNorm) #1966

Uh oh!

negvet commented Jul 18, 2025 •

edited

Loading

Uh oh!

Marks101 left a comment

Uh oh!

Uh oh!

negvet commented Jul 22, 2025 •

edited

Loading

Uh oh!

negvet commented Jul 22, 2025

Uh oh!

Marks101 left a comment

Uh oh!

negvet commented Jul 24, 2025

Uh oh!

timmoon10 left a comment

Uh oh!

Uh oh!

Uh oh!

[PyTorch] Enable generic QK norm support (+ RMSNorm/LayerNorm) #1966

[PyTorch] Enable generic QK norm support (+ RMSNorm/LayerNorm) #1966

Uh oh!

Conversation

negvet commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

Marks101 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

negvet commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

negvet commented Jul 22, 2025

Uh oh!

Marks101 left a comment

Choose a reason for hiding this comment

Uh oh!

negvet commented Jul 24, 2025

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

negvet commented Jul 18, 2025 •

edited

Loading

negvet commented Jul 22, 2025 •

edited

Loading