Skip to content

[Optimization] enable trtllm_all_reduce fusion kernel in glm model#6660

Merged
zoooo0820 merged 22 commits into
PaddlePaddle:developfrom
BingooYang:trtllm_allreduce
Apr 16, 2026
Merged

[Optimization] enable trtllm_all_reduce fusion kernel in glm model#6660
zoooo0820 merged 22 commits into
PaddlePaddle:developfrom
BingooYang:trtllm_allreduce

Conversation

@BingooYang
Copy link
Copy Markdown
Contributor

@BingooYang BingooYang commented Mar 4, 2026

Motivation

FD接入trtllm_allreduce_fusion算子

Modifications

  1. FD新增flashinfer allreduce fusion算子接入
  2. 更改GLM-Air-4.5模型组网结构接入trtllm_allreduce_fusion算子(默认不开启)
  3. 新增命令行参数--enable-flashinfer-allreduce-fusion,通过该参数来使能trtllm_allreduce_fusion
  4. 新增trtllm_allreduce_fusion算子单测
  5. 将def has_flashinfer()函数挪动到utils.py中
  6. 升级flashinfer版本到0.4.1.2(python接口修复、C++20兼容修复)
  7. 测试中增加删除flashinfer cache功能(CI机器上没有清理会有问题)
  8. import flashinfer改为lazy import方式,修复全局import和paddle compat同时存在导致模型加载时走到torch接口的问题
  9. 一些测试中补充--enable-flashinfer-allreduce-fusio设置

Usage or Command

H卡和B卡本地测试均通过
python -m fastdeploy.entrypoints.openai.api_server --model /root/paddlejob/workspace/bingoo/model/GLM-4.5-Air --tensor-parallel-size 4 --port 8185 --max-num-batched-tokens 2048 --enable-flashinfer-allreduce-fusion

Accuracy Tests

python -m paddle.distributed.launch --gpus=0,1 ./FastDeploy/tests/layers/test_rms_allreduce_fusion.py

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Mar 4, 2026

Thanks for your contribution!

@BingooYang BingooYang changed the title enable trtllm_all_reduce fusion kernel in glm model [Optimization] enable trtllm_all_reduce fusion kernel in glm model Mar 5, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 87.96296% with 13 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@dec0b06). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...oy/model_executor/layers/flashinfer_comm_fusion.py 90.90% 4 Missing and 4 partials ⚠️
fastdeploy/model_executor/layers/normalization.py 40.00% 2 Missing and 1 partial ⚠️
fastdeploy/model_executor/layers/linear.py 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6660   +/-   ##
==========================================
  Coverage           ?   74.19%           
==========================================
  Files              ?      395           
  Lines              ?    54859           
  Branches           ?     8593           
==========================================
  Hits               ?    40700           
  Misses             ?    11416           
  Partials           ?     2743           
Flag Coverage Δ
GPU 74.18% <87.96%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@BingooYang
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

1 similar comment
@BingooYang
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

@BingooYang
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

2 similar comments
@BingooYang
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

@BingooYang
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

@BingooYang
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-15

📋 Review 摘要

PR 概述:在 GLM 模型中启用 trtllm_allreduce_fusion 算子以优化性能

变更范围:model_executor/layers/、model_executor/models/glm4_moe.py、config/、engine/

影响面 Tag[OP] [Models] [Optimization]

发现的问题

级别 文件 概述
🔴 Bug normalization.py:253 fusion 失败时缺少降级机制,会导致运行时崩溃
🟡 建议 flashinfer_comm_fusion.py:206 cleanup 函数未被调用,可能导致资源泄漏

总体评价

PR 实现了 flashinfer allreduce fusion 算子的接入,代码结构清晰,测试覆盖较全面。但存在 fusion 失败时缺少降级机制的问题,可能导致生产环境运行时崩溃。建议修复后再合并。

norm_out = flashinfer_allreduce_residual_rmsnorm(
fd_config=self.fd_config, input_tensor=x, residual=residual_input, weight=self.weight, eps=self.eps
)
assert norm_out[0] is not None, "Trtllm-all-reduce fusion failed!"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug fusion 失败时缺少降级机制

当 flashinfer 不可用、workspace 初始化失败等情况时,flashinfer_allreduce_residual_rmsnorm 返回 (None, None),但这里会抛出断言错误。这可能导致生产环境运行时崩溃。

建议修改为优雅降级:

elif self.enable_all_reduce_fusion and x.shape[0] <= 2048:
    norm_out, residual_out = flashinfer_allreduce_residual_rmsnorm(
        fd_config=self.fd_config, input_tensor=x, residual=residual_input, weight=self.weight, eps=self.eps
    )
    if norm_out is None or residual_out is None:
        # Fallback to standard all-reduce + RMSNorm
        if is_batch_invariant_mode_enabled():
            if residual_input is not None:
                x = x + residual_input
            norm_out = rms_norm_batch_invariant(x, self.weight, self.eps), x
        else:
            norm_out = self.norm_func(
                x,
                norm_weight=self.weight,
                norm_bias=None,
                epsilon=self.eps,
                begin_norm_axis=self.begin_norm_axis,
                bias=self.bias,
                residual=residual_input,
                quant_scale=(-1 if self.quant_scale is None else self.quant_scale),
                quant_round_type=self.quant_round_type,
                quant_max_bound=self.quant_max_bound,
                quant_min_bound=self.quant_min_bound,
            )

return norm_out, residual_out


def cleanup_flashinfer_workspace():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 cleanup 函数未被调用

cleanup_flashinfer_workspace 函数已定义但在生产代码中未调用,可能导致 IPC workspace 资源泄漏。

建议在以下场景调用:

  1. Worker 进程 shutdown 时
  2. 模型卸载时
  3. 配置变更需要重新初始化时

可参考:fastdeploy/worker/worker_process.py 或 engine shutdown 逻辑中添加清理调用。

@BingooYang
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

@zoooo0820 zoooo0820 merged commit 6b891da into PaddlePaddle:develop Apr 16, 2026
54 of 59 checks passed
xiaoguoguo626807 pushed a commit to xiaoguoguo626807/FastDeploy that referenced this pull request May 7, 2026
…addlePaddle#6660)

* enable trtllm_all_reduce fusion kernel in glm model

* fix conflict

* format update

* fix a bug

* modify test

* modify test

* support empty tensor and modify test

* fix test_linear config issues

* modify test name

* add edge test case

* modify format

* fix conflict

* modify default max token num in trtllm_allreduce_fusion

* add max token num branch for trtllm_allreduce_fusion

* fix format

* fix rmsnorm config issue

* modify 2025 to 2026

* using compat grard

* Lazily import flashinfer.comm and fix test config issue

* fix test issues

* add flashinfer cache dir clean machine

* fix some issues
K11OntheBoat pushed a commit that referenced this pull request May 12, 2026
… glm model (#6660) (#7228)

* enable trtllm_all_reduce fusion kernel in glm model

* update flashinfer paddle version

* format update

modify test

modify test

support empty tensor and modify test

fix test_linear config issues

modify test name

add edge test case

modify format

fix conflict

modify default max token num in trtllm_allreduce_fusion

add max token num branch for trtllm_allreduce_fusion

fix format

fix rmsnorm config issue

modify 2025 to 2026

enable trtllm_allreduce fusion

Revert "[Cherry-Pick][CI] Use GPU-Build-RL runner for _build_linux_rl.yml (#7186) (#7195)"

This reverts commit ca2f38b.

Revert "[Cherry-Pick][BugFix] prevent requests from entering running state without a slot(#7141) (#7181)"

This reverts commit 80f4a72.

clean flashinfer cache and modify test

fix dumpy patch issue

fix some issues

* remove redundent

* enable moe reduce fusion

* fix test

* fix cuda context issue

* update flashinfer version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants