Skip to content

Conversation

@ccsuzzh
Copy link
Contributor

@ccsuzzh ccsuzzh commented Oct 13, 2025

  • auto generate MultiQueryAppendC4Attention template_instantiation into multiple cu file
  • auto generate MultiQueryAppendAttention template_instantiation into multiple cu file
  • auto generate MultiQueryDecoderAttention template_instantiation into multiple cu file
  • add -t to nvcc_compile_args

thread_num: 4
build_and_install_ops compile time: 03:53:29 -> 02:58:20

@ccsuzzh ccsuzzh changed the title 【Hackathon 9th No.86】autogen MultiQueryDecoderAttention template_instantiation -part 【Hackathon 9th No.86】autogen MultiQueryDecoderAttention template_instantiation -part Oct 13, 2025
@paddle-bot
Copy link

paddle-bot bot commented Oct 13, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Oct 13, 2025
@ccsuzzh
Copy link
Contributor Author

ccsuzzh commented Oct 14, 2025

/re-run all-failed

@YuanRisheng YuanRisheng requested a review from Copilot October 14, 2025 08:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements auto-generation of template instantiation files for multiple attention kernel types and adds parallel compilation support for NVCC. The primary goal is to improve build efficiency and maintainability by automating template instantiation generation that was previously done manually.

Key changes include:

  • Auto-generation of MultiQueryAppendC4Attention, MultiQueryAppendAttention, and MultiQueryDecoderAttention template instantiation files
  • Addition of -t flag to NVCC compile arguments for parallel compilation
  • Refactoring of existing template code into separate implementation files

Reviewed Changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
custom_ops/setup_ops.py Adds parallel compilation support by including -t flag with worker thread count
custom_ops/gpu_ops/append_attn/template_config.json Configuration file defining template generation parameters for attention kernels
custom_ops/gpu_ops/append_attn/autogen_template_instantiation.py Universal template instantiator that generates files based on JSON configuration
Multiple template_instantiation/*.cu files Removal of manually written template instantiation files (now auto-generated)
Multiple *_impl.cuh and *_kernel.h files Refactored implementation files separating kernel definitions from instantiations

const int num_chunks = div_up(max_seq_len, chunk_size);
dim3 grids(num_blocks_x_cpu, num_chunks, kv_num_heads);
dim3 blocks(32, num_warps);
if (num_chunks <= 0) {
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition num_chunks <= 0 should be num_chunks <= 1 to match the logic used in other similar implementations. This prevents execution when there's only one chunk, which should use the no-split kernel path.

Suggested change
if (num_chunks <= 0) {
if (num_chunks <= 1) {

Copilot uses AI. Check for mistakes.

dim3 grids(num_blocks_x_cpu, num_chunks, kv_num_heads);
dim3 blocks(32, num_warps);
if (num_chunks <= 0) {
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition num_chunks <= 0 should be num_chunks <= 1 to match the logic used in other similar implementations. This prevents execution when there's only one chunk, which should use the no-split kernel path.

Suggested change
if (num_chunks <= 0) {
if (num_chunks <= 1) {

Copilot uses AI. Check for mistakes.
@YuanRisheng
Copy link
Collaborator

性能提升的情况也在描述里说一下

@ccsuzzh
Copy link
Contributor Author

ccsuzzh commented Oct 14, 2025

/re-run all-failed

@YuanRisheng YuanRisheng reopened this Oct 14, 2025
@YuanRisheng
Copy link
Collaborator

好像ci今天由于paddle版本有点问题,可以重新提交代码触发重跑

@ccsuzzh
Copy link
Contributor Author

ccsuzzh commented Oct 14, 2025

好像ci今天由于paddle版本有点问题,可以重新提交代码触发重跑

好的

@ccsuzzh
Copy link
Contributor Author

ccsuzzh commented Oct 14, 2025

性能提升的情况也在描述里说一下

Done

@ccsuzzh
Copy link
Contributor Author

ccsuzzh commented Oct 15, 2025

/re-run all-failed

@ccsuzzh ccsuzzh requested a review from YuanRisheng October 16, 2025 00:34
@YuanRisheng YuanRisheng merged commit 6adfbe0 into PaddlePaddle:develop Oct 16, 2025
15 of 17 checks passed
@ccsuzzh ccsuzzh deleted the decode_attention_kernel branch October 16, 2025 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants