Skip to content

[Cherry-Pick] Fix moe topk select bug in cudagraph(#7069)#7070

Merged
zoooo0820 merged 1 commit intoPaddlePaddle:release/2.5from
zhangbo9674:cp/fix_top_bug
Mar 30, 2026
Merged

[Cherry-Pick] Fix moe topk select bug in cudagraph(#7069)#7070
zoooo0820 merged 1 commit intoPaddlePaddle:release/2.5from
zhangbo9674:cp/fix_top_bug

Conversation

@zhangbo9674
Copy link
Copy Markdown
Contributor

Motivation

[Cherry-Pick] #7069

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings March 30, 2026 02:50
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 30, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 从 develop 分支的 #7069 进行 cherry-pick,目标是修复 CUDA Graph(cudagraph)场景下 MoE 的 topk/group 选择逻辑在特定实现上的兼容性问题,避免路由选择在 cudagraph capture/replay 下出错。

Changes:

  • 使用 one_hot + sum 生成 group mask,替代 put_along_axis 路径。
  • 使用 paddle.index_sample 替代 paddle.take_along_axis 获取 topk 权重。

Comment on lines +188 to +190
group_mask = paddle.sum(
paddle.nn.functional.one_hot(group_idx, num_classes=n_group).cast(group_scores.dtype),
axis=1, # Sum over topk_group dimension -> [seq_len, n_group]
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前 PR 标题不符合仓库 Cherry-Pick 约定:需要在 [Cherry-Pick] 后包含至少一个标签(如 [BugFix]),并在标题末尾追加原 develop PR 号(此处应为 (#7069)),否则可能触发 CI 的 Cherry-Pick 校验失败。建议按模板格式调整标题。

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +189 to 191
paddle.nn.functional.one_hot(group_idx, num_classes=n_group).cast(group_scores.dtype),
axis=1, # Sum over topk_group dimension -> [seq_len, n_group]
)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 描述里 Modifications / Usage or Command / Accuracy Tests 等关键字段为空。由于该改动是修复 cudagraph 下的路由选择 bug,建议补充:复现方式/影响范围、修复原理,以及至少一个可运行的验证命令或准确性/回归结果,便于 release 分支风险评估。

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 197 to +198
_, topk_ids = paddle.topk(probs_for_choice, top_k, axis=-1)
topk_weights = paddle.take_along_axis(gate_probs, topk_ids, axis=-1)
topk_weights = paddle.index_sample(gate_probs, topk_ids)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的修改是为了解决 cudagraph 场景下的 topk/group mask 选择问题,但当前单测(例如 tests/operators/test_noaux_tc_redundant.py)只覆盖了数值正确性,未覆盖 CUDA Graph capture/replay。建议新增或扩展单测:在 paddle.device.cuda.graphs.CUDAGraph 的 capture/replay 中运行 moe_topk_select(包含 n_group>1 && topk_group<n_group 分支),以避免该类回归再次出现。

Copilot generated this review using guidance from repository custom instructions.
Copy link
Copy Markdown
Collaborator

@zoooo0820 zoooo0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.5@f4caa18). Learn more about missing BASE report.

Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.5    #7070   +/-   ##
==============================================
  Coverage               ?   68.97%           
==============================================
  Files                  ?      390           
  Lines                  ?    54086           
  Branches               ?     8518           
==============================================
  Hits                   ?    37306           
  Misses                 ?    14079           
  Partials               ?     2701           
Flag Coverage Δ
GPU 68.97% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zhangbo9674 zhangbo9674 changed the title [Cherry-Pick] Fix moe topk select bug in cudagraph [Cherry-Pick] Fix moe topk select bug in cudagraph(#7069) Mar 30, 2026
@zoooo0820 zoooo0820 merged commit 5f5c932 into PaddlePaddle:release/2.5 Mar 30, 2026
36 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants