Optimize the Cuda Kernel performance of Paddle rms_norm #77098

zhengshengning · 2025-12-25T13:33:20Z

PR Category

Operator Mechanism

PR Types

Improvements

Description

优化rms_norm
a. 【完成】精度：已与torch逐位对齐（fp16、bf16、fp32、fp64）
b. 【完成】功能：原来Paddle rms_norm缺少normalized_shape参数，且不支持weight为空，已与Torch对齐
c. 【完成】性能：目前前向、反向基本与Torch持平
d. 【完成】性能2：API前处理可优化，kernel内可进行向量化加长、分支和融合等优化，目前性能全面优于torch

rms_norm前向：（优化前后，相比Paddle旧的实现平均提升2.2倍，相比Torch平均提升6%）

rms_norm反向：（优化前后，相比Paddle旧的实现平均提升1.6倍，相比Torch平均提升31%）

paddle-bot · 2025-12-25T13:33:28Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

codecov-commenter · 2025-12-26T06:32:05Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@20ae51f). Learn more about missing BASE report.

Additional details and impacted files

@@             Coverage Diff             @@
##             develop    #77098   +/-   ##
===========================================
  Coverage           ?   100.00%           
===========================================
  Files              ?         4           
  Lines              ?        37           
  Branches           ?         0           
===========================================
  Hits               ?        37           
  Misses             ?         0           
  Partials           ?         0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This reverts commit ccaaa1b.

This reverts commit 19513e8.

… opt_rms_norm

wanghuancoder

LGTM

A-nnonymous

LGTM in kernel in most case, but need polish afterwards.

A-nnonymous · 2026-01-05T04:07:04Z

paddle/phi/kernels/gpu/rms_norm_cuda_kernel.h

+  }
+  inline __device__ res_t project(acc_t acc) const {
+    const auto mean = static_cast<scalar_t>(acc.mean);
+    const auto divisor = acc.nf > correction ? acc.nf - correction : 0;


之后这里用显式类型声明，避免使用auto

A-nnonymous · 2026-01-05T04:08:03Z

paddle/phi/kernels/gpu/rms_norm_cuda_kernel.h

+  if (threadIdx.x == 0) {
+    T_ACC m1;  // mean
+    T_ACC m2;  // var
+    thrust::pair<T_ACC, T_ACC> res = welford_op.project(val);


在将来的修改中，尽量避免引入thrust相关的数据结构与算法，方便移植

A-nnonymous · 2026-01-05T04:11:03Z

paddle/phi/kernels/gpu/rms_norm_cuda_kernel.h

+  int64_t thread_x = static_cast<int64_t>(blockIdx.x) * block_dim_x +
+                     static_cast<int64_t>(threadIdx.x);
+
+  int lane_id = (threadIdx.y * blockDim.x + threadIdx.x) & (kWarpSize - 1);


lane_id有更高效的求取方式，优化时需要注意这部分开销

zrr1999

LGTM

zrr1999 · 2026-01-05T11:35:47Z

python/paddle/nn/functional/norm.py

 def rms_norm(
    input: Tensor,
-    normalized_shape: int | Sequence[int],
+    normalized_shape: list[int],


用Sequence[int]吧

zyfncg · 2026-01-05T12:04:12Z

paddle/phi/ops/yaml/ops.yaml


 - op: rms_norm
-  args: (Tensor x, Tensor scale, float epsilon)
+  args: (Tensor x, Tensor scale, IntArray normalized_shape={}, double epsilon= 1e-5)


normalized_shape 如果 python api 明确是 int 类型，这里可以使用 int64_t[]

XiaoguangHu01

LGTM

…#77098) * accuracy and Torch alignment * support rms_norm behavior to be the same as torch * fix rms_norm_xpu_kernel * add valueError_test * Revert "add valueError_test" This reverts commit ccaaa1b. * Reapply "add valueError_test" This reverts commit 19513e8. * optimize performance * add vectorization * fix * fix dtype of normalized_shape

accuracy and Torch alignment

a45bae9

zhengshengning added 3 commits December 26, 2025 19:09

support rms_norm behavior to be the same as torch

c41f30a

fix rms_norm_xpu_kernel

9461211

add valueError_test

ccaaa1b

zrr1999 force-pushed the opt_rms_norm branch 2 times, most recently from 0604b69 to 65aff3e Compare December 30, 2025 12:36

Revert "add valueError_test"

19513e8

This reverts commit ccaaa1b.

zrr1999 force-pushed the opt_rms_norm branch from 65aff3e to 19513e8 Compare December 30, 2025 12:40

zhengshengning added 3 commits December 30, 2025 20:45

Reapply "add valueError_test"

0bed886

This reverts commit 19513e8.

optimize performance

6a54c89

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

f5713b0

… opt_rms_norm

zhengshengning changed the title ~~accuracy and Torch alignment~~ Optimize the Cuda Kernel performance of Paddle rms_norm Dec 31, 2025

zhengshengning added 2 commits December 31, 2025 17:00

add vectorization

307aace

merge develop

3e4883d

wanghuancoder previously approved these changes Jan 4, 2026

View reviewed changes

A-nnonymous approved these changes Jan 5, 2026

View reviewed changes

zhengshengning added 2 commits January 5, 2026 14:45

fix

678e45c

merge develop

98f762b

zrr1999 dismissed wanghuancoder’s stale review via 98f762b January 5, 2026 06:47

zrr1999 previously approved these changes Jan 5, 2026

View reviewed changes

zyfncg reviewed Jan 5, 2026

View reviewed changes

zhangbo9674 previously approved these changes Jan 5, 2026

View reviewed changes

fix dtype of normalized_shape

1837a81

zrr1999 dismissed stale reviews from zhangbo9674 and themself via 1837a81 January 5, 2026 13:12

zyfncg approved these changes Jan 6, 2026

View reviewed changes

zrr1999 approved these changes Jan 6, 2026

View reviewed changes

wanghuancoder approved these changes Jan 6, 2026

View reviewed changes

zhangbo9674 approved these changes Jan 6, 2026

View reviewed changes

XiaoguangHu01 approved these changes Jan 6, 2026

View reviewed changes

heavengate approved these changes Jan 6, 2026

View reviewed changes

zhengshengning merged commit a1d519c into PaddlePaddle:develop Jan 6, 2026
148 of 167 checks passed

zrr1999 mentioned this pull request Jan 8, 2026

[Cherry-pick] add rms_norm and fix large tensor #77265

Open

Optimize the Cuda Kernel performance of Paddle rms_norm #77098

Optimize the Cuda Kernel performance of Paddle rms_norm #77098

Uh oh!

Conversation

zhengshengning commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Dec 25, 2025

Uh oh!

codecov-commenter commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wanghuancoder left a comment

Choose a reason for hiding this comment

Uh oh!

A-nnonymous left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zrr1999 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

zhengshengning commented Dec 25, 2025 •

edited

Loading

codecov-commenter commented Dec 26, 2025 •

edited

Loading