Skip to content

[Cherry-pick][Optimization] enable trtllm_all_reduce fusion kernel in glm model#7219

Open
BingooYang wants to merge 18 commits intoPaddlePaddle:release/2.5from
BingooYang:2.5/trtllm_allreduce
Open

[Cherry-pick][Optimization] enable trtllm_all_reduce fusion kernel in glm model#7219
BingooYang wants to merge 18 commits intoPaddlePaddle:release/2.5from
BingooYang:2.5/trtllm_allreduce

Conversation

@BingooYang
Copy link
Copy Markdown
Contributor

Motivation

FD接入trtllm_allreduce_fusion算子

Modifications

  1. FD新增flashinfer allreduce fusion算子接入
  2. 更改GLM-Air-4.5模型组网结构接入trtllm_allreduce_fusion算子(默认不开启)
  3. 新增命令行参数--enable-flashinfer-allreduce-fusion,通过该参数来使能trtllm_allreduce_fusion
  4. 新增trtllm_allreduce_fusion算子单测
  5. 将def has_flashinfer()函数挪动到utils.py中

Usage or Command

H卡和B卡本地测试均通过
python -m fastdeploy.entrypoints.openai.api_server --model /root/paddlejob/workspace/bingoo/model/GLM-4.5-Air --tensor-parallel-size 4 --port 8185 --max-num-batched-tokens 2048 --enable-flashinfer-allreduce-fusion

Accuracy Tests

python -m paddle.distributed.launch --gpus=0,1 ./FastDeploy/tests/layers/test_rms_allreduce_fusion.py

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 7, 2026

Thanks for your contribution!

Copy link
Copy Markdown

@fastdeploy-bot fastdeploy-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Review 摘要

PR 概述:为 GLM-Air-4.5 模型启用 trtllm_all_reduce fusion kernel,通过新增 flashinfer 融合算子提升性能。

变更范围:model_executor/layers/、model_executor/models/glm4_moe.py、config/、engine/

影响面 Tag[Optimization] [Models] [OP]

📝 PR 规范检查

PR 符合规范:

  • 标题包含有效 Tag [Optimization]
  • Motivation 和 Modifications 描述清晰
  • Checklist 项目完整
  • 已提供 Usage 和 Accuracy Tests

问题

级别 文件 概述
🔴 Bug normalization.py:249-252 fusion 返回 None 时会导致 AttributeError
🟡 建议 normalization.py:249, linear.py:941, flashinfer_comm_fusion.py:87,118 max_token_num 硬编码 2048,建议从配置读取

总体评价

PR 实现了 trtllm_all_reduce fusion kernel 的接入,测试覆盖了主要路径。但存在一个关键的 bug:当 flashinfer 不可用时,fusion 函数返回 (None, None),但调用方没有正确处理这种情况,会导致运行时错误。建议修复后合并。

# enable trtllm all reduce fusion
elif self.enable_all_reduce_fusion and x.shape[0] <= 2048:
norm_out = flashinfer_allreduce_residual_rmsnorm(
fd_config=self.fd_config, input_tensor=x, residual=residual_input, weight=self.weight, eps=self.eps
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bugflashinfer_allreduce_residual_rmsnorm 返回 (None, None) 时(flashinfer 不可用或 workspace 未初始化),代码会尝试访问 norm_out[0].astype(),导致 AttributeError: 'NoneType' object has no attribute 'astype'

建议修复方式:

# enable trtllm all reduce fusion
elif self.enable_all_reduce_fusion and x.shape[0] <= 2048:
    norm_out = flashinfer_allreduce_residual_rmsnorm(
        fd_config=self.fd_config, input_tensor=x, residual=residual_input, weight=self.weight, eps=self.eps
    )
    # Check if fusion succeeded, fallback to normal path if not
    if norm_out[0] is None or norm_out[1] is None:
        norm_out = self.norm_func(
            x,
            norm_weight=self.weight,
            norm_bias=None,
            epsilon=self.eps,
            begin_norm_axis=self.begin_norm_axis,
            bias=self.bias,
            residual=residual_input,
            quant_scale=(-1 if self.quant_scale is None else self.quant_scale),
            quant_round_type=self.quant_round_type,
            quant_max_bound=self.quant_max_bound,
            quant_min_bound=self.quant_min_bound,
        )

residual: paddle.Tensor,
weight: paddle.Tensor,
eps: float = 1e-6,
max_token_num: int = 2048,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 max_token_num 在多处硬编码为 2048,限制了配置灵活性。建议从 FDConfig 中读取此参数。

影响位置:

  1. linear.py:941 - out.shape[0] <= 2048
  2. normalization.py:249 - x.shape[0] <= 2048
  3. flashinfer_comm_fusion.py:87 - max_token_num: int = 2048 (默认参数)
  4. flashinfer_comm_fusion.py:118 - max_token_num: int = 2048 (默认参数)

建议在 FDConfig 中添加 flashinfer_allreduce_max_token_num 字段,统一配置。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants