[Optimization] enable trtllm_all_reduce fusion kernel in glm model by BingooYang · Pull Request #6660 · PaddlePaddle/FastDeploy

BingooYang · 2026-03-04T13:10:11Z

Motivation

FD接入trtllm_allreduce_fusion算子

Modifications

FD新增flashinfer allreduce fusion算子接入
更改GLM-Air-4.5模型组网结构接入trtllm_allreduce_fusion算子（默认不开启）
新增命令行参数--enable-flashinfer-allreduce-fusion，通过该参数来使能trtllm_allreduce_fusion
新增trtllm_allreduce_fusion算子单测
将def has_flashinfer()函数挪动到utils.py中
升级flashinfer版本到0.4.1.2（python接口修复、C++20兼容修复）
测试中增加删除flashinfer cache功能（CI机器上没有清理会有问题）
import flashinfer改为lazy import方式，修复全局import和paddle compat同时存在导致模型加载时走到torch接口的问题
一些测试中补充--enable-flashinfer-allreduce-fusio设置

Usage or Command

H卡和B卡本地测试均通过
python -m fastdeploy.entrypoints.openai.api_server --model /root/paddlejob/workspace/bingoo/model/GLM-4.5-Air --tensor-parallel-size 4 --port 8185 --max-num-batched-tokens 2048 --enable-flashinfer-allreduce-fusion

Accuracy Tests

python -m paddle.distributed.launch --gpus=0,1 ./FastDeploy/tests/layers/test_rms_allreduce_fusion.py

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-03-04T13:10:19Z

Thanks for your contribution!

codecov-commenter · 2026-03-05T07:15:18Z

Codecov Report

❌ Patch coverage is 87.96296% with 13 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@dec0b06). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...oy/model_executor/layers/flashinfer_comm_fusion.py	90.90%	4 Missing and 4 partials ⚠️
fastdeploy/model_executor/layers/normalization.py	40.00%	2 Missing and 1 partial ⚠️
fastdeploy/model_executor/layers/linear.py	60.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6660   +/-   ##
==========================================
  Coverage           ?   74.19%           
==========================================
  Files              ?      395           
  Lines              ?    54859           
  Branches           ?     8593           
==========================================
  Hits               ?    40700           
  Misses             ?    11416           
  Partials           ?     2743

Flag	Coverage Δ
GPU	`74.18% <87.96%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

BingooYang · 2026-03-06T08:33:39Z

/re-run all-failed

BingooYang · 2026-03-06T11:04:00Z

/re-run all-failed

BingooYang · 2026-03-10T11:18:34Z

/re-run all-failed

BingooYang · 2026-03-10T12:36:21Z

/re-run all-failed

BingooYang · 2026-03-10T13:09:15Z

/re-run all-failed

BingooYang · 2026-03-11T10:55:19Z

/re-run all-failed

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-15

📋 Review 摘要

PR 概述：在 GLM 模型中启用 trtllm_allreduce_fusion 算子以优化性能

变更范围：model_executor/layers/、model_executor/models/glm4_moe.py、config/、engine/

影响面 Tag：[OP] [Models] [Optimization]

发现的问题

级别	文件	概述
🔴 Bug	`normalization.py:253`	fusion 失败时缺少降级机制，会导致运行时崩溃
🟡 建议	`flashinfer_comm_fusion.py:206`	cleanup 函数未被调用，可能导致资源泄漏

总体评价

PR 实现了 flashinfer allreduce fusion 算子的接入，代码结构清晰，测试覆盖较全面。但存在 fusion 失败时缺少降级机制的问题，可能导致生产环境运行时崩溃。建议修复后再合并。

PaddlePaddle-bot · 2026-04-15T11:55:26Z

+                norm_out = flashinfer_allreduce_residual_rmsnorm(
+                    fd_config=self.fd_config, input_tensor=x, residual=residual_input, weight=self.weight, eps=self.eps
+                )
+                assert norm_out[0] is not None, "Trtllm-all-reduce fusion failed!"


🔴 Bug fusion 失败时缺少降级机制

当 flashinfer 不可用、workspace 初始化失败等情况时，flashinfer_allreduce_residual_rmsnorm 返回 (None, None)，但这里会抛出断言错误。这可能导致生产环境运行时崩溃。

建议修改为优雅降级：

elif self.enable_all_reduce_fusion and x.shape[0] <= 2048: norm_out, residual_out = flashinfer_allreduce_residual_rmsnorm( fd_config=self.fd_config, input_tensor=x, residual=residual_input, weight=self.weight, eps=self.eps ) if norm_out is None or residual_out is None: # Fallback to standard all-reduce + RMSNorm if is_batch_invariant_mode_enabled(): if residual_input is not None: x = x + residual_input norm_out = rms_norm_batch_invariant(x, self.weight, self.eps), x else: norm_out = self.norm_func( x, norm_weight=self.weight, norm_bias=None, epsilon=self.eps, begin_norm_axis=self.begin_norm_axis, bias=self.bias, residual=residual_input, quant_scale=(-1 if self.quant_scale is None else self.quant_scale), quant_round_type=self.quant_round_type, quant_max_bound=self.quant_max_bound, quant_min_bound=self.quant_min_bound, )

PaddlePaddle-bot · 2026-04-15T11:55:26Z

+    return norm_out, residual_out
+
+
+def cleanup_flashinfer_workspace():


🟡 建议 cleanup 函数未被调用

cleanup_flashinfer_workspace 函数已定义但在生产代码中未调用，可能导致 IPC workspace 资源泄漏。

建议在以下场景调用：

Worker 进程 shutdown 时

模型卸载时

配置变更需要重新初始化时

可参考：fastdeploy/worker/worker_process.py 或 engine shutdown 逻辑中添加清理调用。

BingooYang · 2026-04-16T03:18:25Z

/re-run all-failed

…addlePaddle#6660) * enable trtllm_all_reduce fusion kernel in glm model * fix conflict * format update * fix a bug * modify test * modify test * support empty tensor and modify test * fix test_linear config issues * modify test name * add edge test case * modify format * fix conflict * modify default max token num in trtllm_allreduce_fusion * add max token num branch for trtllm_allreduce_fusion * fix format * fix rmsnorm config issue * modify 2025 to 2026 * using compat grard * Lazily import flashinfer.comm and fix test config issue * fix test issues * add flashinfer cache dir clean machine * fix some issues

… glm model (#6660) (#7228) * enable trtllm_all_reduce fusion kernel in glm model * update flashinfer paddle version * format update modify test modify test support empty tensor and modify test fix test_linear config issues modify test name add edge test case modify format fix conflict modify default max token num in trtllm_allreduce_fusion add max token num branch for trtllm_allreduce_fusion fix format fix rmsnorm config issue modify 2025 to 2026 enable trtllm_allreduce fusion Revert "[Cherry-Pick][CI] Use GPU-Build-RL runner for _build_linux_rl.yml (#7186) (#7195)" This reverts commit ca2f38b. Revert "[Cherry-Pick][BugFix] prevent requests from entering running state without a slot(#7141) (#7181)" This reverts commit 80f4a72. clean flashinfer cache and modify test fix dumpy patch issue fix some issues * remove redundent * enable moe reduce fusion * fix test * fix cuda context issue * update flashinfer version

BingooYang temporarily deployed to Metax_ci March 4, 2026 13:10 — with GitHub Actions Inactive

BingooYang had a problem deploying to Metax_ci March 5, 2026 03:23 — with GitHub Actions Error

BingooYang changed the title ~~enable trtllm_all_reduce fusion kernel in glm model~~ [Optimization] enable trtllm_all_reduce fusion kernel in glm model Mar 5, 2026

BingooYang temporarily deployed to Metax_ci March 5, 2026 03:26 — with GitHub Actions Inactive

BingooYang temporarily deployed to Metax_ci March 5, 2026 05:33 — with GitHub Actions Inactive

BingooYang temporarily deployed to Metax_ci March 5, 2026 08:32 — with GitHub Actions Inactive

BingooYang temporarily deployed to Metax_ci March 5, 2026 14:03 — with GitHub Actions Inactive

BingooYang temporarily deployed to Metax_ci March 6, 2026 02:16 — with GitHub Actions Inactive

BingooYang temporarily deployed to Metax_ci March 6, 2026 06:09 — with GitHub Actions Inactive

BingooYang had a problem deploying to Metax_ci March 10, 2026 02:33 — with GitHub Actions Error

BingooYang had a problem deploying to Metax_ci March 10, 2026 03:14 — with GitHub Actions Failure

BingooYang had a problem deploying to Metax_ci March 10, 2026 05:53 — with GitHub Actions Error

BingooYang force-pushed the trtllm_allreduce branch from e0fd641 to b314228 Compare March 10, 2026 06:09

BingooYang temporarily deployed to Metax_ci March 10, 2026 06:09 — with GitHub Actions Inactive

BingooYang had a problem deploying to Metax_ci March 11, 2026 07:11 — with GitHub Actions Failure

BingooYang had a problem deploying to Metax_ci March 11, 2026 10:55 — with GitHub Actions Failure

BingooYang force-pushed the trtllm_allreduce branch from 08d2f16 to 09cb26d Compare March 11, 2026 12:43

BingooYang had a problem deploying to Metax_ci March 11, 2026 12:43 — with GitHub Actions Failure

BingooYang had a problem deploying to Metax_ci March 11, 2026 13:38 — with GitHub Actions Failure

BingooYang force-pushed the trtllm_allreduce branch from b7c4a47 to be78caa Compare March 24, 2026 03:26

BingooYang temporarily deployed to Metax_ci March 24, 2026 03:27 — with GitHub Actions Inactive

BingooYang temporarily deployed to Metax_ci March 26, 2026 07:36 — with GitHub Actions Inactive

BingooYang added 18 commits April 15, 2026 19:36

modify test

ed77444

modify test

5612a65

support empty tensor and modify test

aae0b1d

fix test_linear config issues

c0790e1

modify test name

b2be3f9

add edge test case

df0c96e

modify format

bf7df5c

fix conflict

dc0499d

modify default max token num in trtllm_allreduce_fusion

155c363

add max token num branch for trtllm_allreduce_fusion

9df022b

fix format

25e6615

fix rmsnorm config issue

4b462ea

modify 2025 to 2026

23c5838

using compat grard

335527c

Lazily import flashinfer.comm and fix test config issue

4edd889

fix test issues

11a1cab

add flashinfer cache dir clean machine

771e5ad

fix some issues

912aab4

BingooYang dismissed stale reviews from qingqing01 and zoooo0820 via 912aab4 April 15, 2026 11:37

BingooYang force-pushed the trtllm_allreduce branch from 93fad2c to 912aab4 Compare April 15, 2026 11:37

BingooYang had a problem deploying to Metax_ci April 15, 2026 11:37 — with GitHub Actions Failure

PaddlePaddle-bot reviewed Apr 15, 2026

View reviewed changes

BingooYang had a problem deploying to Metax_ci April 16, 2026 03:18 — with GitHub Actions Failure

qingqing01 approved these changes Apr 16, 2026

View reviewed changes

zoooo0820 merged commit 6b891da into PaddlePaddle:develop Apr 16, 2026
54 of 59 checks passed

PaddlePaddle-bot mentioned this pull request Apr 28, 2026

[Cherry-Pick][Optimization] enable trtllm_all_reduce fusion kernel in glm model (#6660) #7228

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization] enable trtllm_all_reduce fusion kernel in glm model#6660

[Optimization] enable trtllm_all_reduce fusion kernel in glm model#6660
zoooo0820 merged 22 commits into
PaddlePaddle:developfrom
BingooYang:trtllm_allreduce

BingooYang commented Mar 4, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented Mar 4, 2026

Uh oh!

codecov-commenter commented Mar 5, 2026 •

edited

Loading

Uh oh!

BingooYang commented Mar 6, 2026

Uh oh!

BingooYang commented Mar 6, 2026

Uh oh!

BingooYang commented Mar 10, 2026

Uh oh!

BingooYang commented Mar 10, 2026

Uh oh!

BingooYang commented Mar 10, 2026

Uh oh!

BingooYang commented Mar 11, 2026

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Apr 15, 2026

Uh oh!

PaddlePaddle-bot Apr 15, 2026

Uh oh!

BingooYang commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		return norm_out, residual_out


		def cleanup_flashinfer_workspace():

Conversation

BingooYang commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented Mar 4, 2026

Uh oh!

codecov-commenter commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

BingooYang commented Mar 6, 2026

Uh oh!

BingooYang commented Mar 6, 2026

Uh oh!

BingooYang commented Mar 10, 2026

Uh oh!

BingooYang commented Mar 10, 2026

Uh oh!

BingooYang commented Mar 10, 2026

Uh oh!

BingooYang commented Mar 11, 2026

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

发现的问题

总体评价

Uh oh!

PaddlePaddle-bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

BingooYang commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

BingooYang commented Mar 4, 2026 •

edited

Loading

codecov-commenter commented Mar 5, 2026 •

edited

Loading