Skip to content

[XPU][OP] Add build_sampling_params kernel for MTP speculative decoding#8032

Open
Clarity256 wants to merge 3 commits into
PaddlePaddle:developfrom
Clarity256:feature/xpu-build-sampling-params-kernel
Open

[XPU][OP] Add build_sampling_params kernel for MTP speculative decoding#8032
Clarity256 wants to merge 3 commits into
PaddlePaddle:developfrom
Clarity256:feature/xpu-build-sampling-params-kernel

Conversation

@Clarity256

@Clarity256 Clarity256 commented Jun 10, 2026

Copy link
Copy Markdown

Motivation

在 XPU MTP 投机解码启用 CUDAGraph 的过程中,原有的 padding_sampling_params(Python 侧 CPU 实现)会产生 host-device 同步,无法被 CUDAGraph 捕获。本 PR 新增 build_sampling_params XPU 自定义算子,将 sampling 参数(top_p, top_k, topp_seed)的构造和 infer_seed 的原地更新完全在 device 端完成,为后续 CUDAGraph capture 扫清障碍。

Modifications

  • custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc:新增 Paddle 自定义算子入口,注册 build_sampling_params op。
  • custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h:声明 build_sampling_params C 接口。
  • custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu:新增 XPU3 kernel 实现。
  • custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp:新增 CPU wrapper 和 XPU3 wrapper。
  • custom_ops/xpu_ops/test/test_build_sampling_params.py:新增单元测试,覆盖纯 decoder、纯 encoder、混合、单条、seed wrap-around 等场景。

Usage or Command

cd custom_ops/xpu_ops/test && python test_build_sampling_params.py

Accuracy Tests

单元测试对比 Python reference 实现(原 padding_sampling_params 逻辑),全部 case 通过。

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Add a new XPU custom operator `build_sampling_params` that constructs
sampling parameters (top_p, top_k, topp_seed) on device for MTP
speculative decoding verification. This replaces the previous Python-level
`padding_sampling_params` approach with a more efficient XPU kernel
implementation that supports CudaGraph capture.

Key components:
- XPU kernel implementation (build_sampling_params.xpu)
- C++ wrapper and op registration
- Plugin header declaration
- Unit tests with comprehensive coverage
@codecov-commenter

codecov-commenter commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@4474188). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8032   +/-   ##
==========================================
  Coverage           ?   67.84%           
==========================================
  Files              ?      470           
  Lines              ?    66111           
  Branches           ?    10187           
==========================================
  Hits               ?    44855           
  Misses             ?    18390           
  Partials           ?     2866           
Flag Coverage Δ
GPU 77.97% <ø> (?)
XPU 7.01% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-12 03:40:38 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: 1abc13b | Merge base: 4474188 (branch: develop)


1 Required任务 : 8/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 37 4 0 0 0
任务 错误类型 置信度 日志
Approval 需要 Approval Job
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 不稳定问题 Job

2 失败详情

🔴 Approval — 需要 Approval(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

  • 根因摘要: 需要人工审批
  • 修复建议: 请通过人工审批
  • 关联变更: 不适用
🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 不稳定问题(置信度: 中)

错误类型: 不稳定问题 | 置信度: 中
分析器: ci_analyze_unittest_fastdeploy
失败用例: 按根因聚类合并

用例 错误摘要
tests/e2e/test_pd_reorder.py::test_model_against_baseline[ernie-4_5-21b-a3b-bf16-paddle.None.default] MTP speculative 路径 300s 超时;rank0 worker 以 code -6 退出,日志出现 libuv uv__finish_close assertion

关键日志:

tests/e2e/test_pd_reorder.py:112 -> RuntimeError: Worker process hung and was terminated
Container rank 0 status failed ... code -6 log /workspace/FastDeploy/unittest_logs/e2e/test_pd_reorder/log/paddle/workerlog.0
python: /paddle/third_party/libuv/src/unix/core.c:314: uv__finish_close: Assertion `handle->flags & UV_HANDLE_CLOSING` failed.
  • 根因摘要: CUDA MTP e2e worker libuv 断言异常退出
    日志显示失败发生在 tests/e2e/test_pd_reorder.py 第二段 MTP speculative 调用,父进程等待 300s 后报 Worker process hung and was terminated。worker 侧 rank0 进程以 code -6 退出,并在 workerlog.0 / fastdeploy.log 中出现 Paddle third_party libuv assertion。
    PR 变更集中在 custom_ops/xpu_ops/.../build_sampling_params 和新增 XPU 单测;失败 job 运行 CUDA/SM e2e 路径,sampler.py 的 CUDA 分支导入 fastdeploy.model_executor.ops.gpu.build_sampling_params,未使用本 PR 新增的 XPU op,未发现与本 PR 直接调用关系。

修复建议:

  1. 先 rerun 该 job;若稳定复现,再由 MTP/CUDA runtime owner 排查 tests/e2e/test_pd_reorder.pytests/model_loader/utils.py 的进程生命周期,以及 Paddle libuv handle 关闭路径。

关联变更: custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cccustom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpucustom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cppcustom_ops/xpu_ops/test/test_build_sampling_params.py;未发现这些 XPU 变更进入当前 CUDA/SM 失败路径。

Align the per-position seed offset stride with the Python
padding_sampling_params implementation it replaces: XPU requires a
stride of 32 (not 4) so that the generated topp_seed sequence matches
the original reference. Update both the kernel and CPU wrapper, and the
unit test reference accordingly.
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-12 18:35:56

📋 Review 摘要

PR 概述:新增 XPU build_sampling_params custom op,用 device 端 kernel 生成 sampling 参数并更新 infer_seed
变更范围custom_ops/xpu_ops 的 op 注册、plugin wrapper/kernel 和 XPU 单测。
影响面 Tag[XPU] [OP] [Speculative Decoding]

问题

未发现新的阻塞性问题。PR 规范问题在下面章节报,不在这里重复。

历史 Findings 修复情况

Finding 问题 状态
F1 新增 op 目前没有接入实际 XPU sampler 路径。 ⚠️ 仍存在

📝 PR 规范检查

标题包含 [XPU][OP] 两个 Tag;FastDeploy 规范要求标题必须且仅包含一个官方 Tag。描述结构符合模板。

标题建议(可直接复制):

  • [XPU] Add build_sampling_params kernel for MTP speculative decoding

总体评价

本轮重点检查了新增 op 注册、XPU plugin 构建发现、kernel/wrapper 与 padding_sampling_params 的 seed 语义,以及 XPU sampler 调用路径。新增文件本身未发现需要阻塞的问题;历史接入问题仍未解决,标题 Tag 仍需按规范收敛。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants