-
Notifications
You must be signed in to change notification settings - Fork 640
【Hackathon 9th No.86】autogen MultiQueryDecoderAttention template_instantiation -part
#4383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon 9th No.86】autogen MultiQueryDecoderAttention template_instantiation -part
#4383
Conversation
MultiQueryDecoderAttention template_instantiation -part
|
Thanks for your contribution! |
|
/re-run all-failed |
custom_ops/gpu_ops/append_attn/autogen_template_instantiation.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements auto-generation of template instantiation files for multiple attention kernel types and adds parallel compilation support for NVCC. The primary goal is to improve build efficiency and maintainability by automating template instantiation generation that was previously done manually.
Key changes include:
- Auto-generation of
MultiQueryAppendC4Attention,MultiQueryAppendAttention, andMultiQueryDecoderAttentiontemplate instantiation files - Addition of
-tflag to NVCC compile arguments for parallel compilation - Refactoring of existing template code into separate implementation files
Reviewed Changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| custom_ops/setup_ops.py | Adds parallel compilation support by including -t flag with worker thread count |
| custom_ops/gpu_ops/append_attn/template_config.json | Configuration file defining template generation parameters for attention kernels |
| custom_ops/gpu_ops/append_attn/autogen_template_instantiation.py | Universal template instantiator that generates files based on JSON configuration |
| Multiple template_instantiation/*.cu files | Removal of manually written template instantiation files (now auto-generated) |
| Multiple *_impl.cuh and *_kernel.h files | Refactored implementation files separating kernel definitions from instantiations |
| const int num_chunks = div_up(max_seq_len, chunk_size); | ||
| dim3 grids(num_blocks_x_cpu, num_chunks, kv_num_heads); | ||
| dim3 blocks(32, num_warps); | ||
| if (num_chunks <= 0) { |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition num_chunks <= 0 should be num_chunks <= 1 to match the logic used in other similar implementations. This prevents execution when there's only one chunk, which should use the no-split kernel path.
| if (num_chunks <= 0) { | |
| if (num_chunks <= 1) { |
|
|
||
| dim3 grids(num_blocks_x_cpu, num_chunks, kv_num_heads); | ||
| dim3 blocks(32, num_warps); | ||
| if (num_chunks <= 0) { |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition num_chunks <= 0 should be num_chunks <= 1 to match the logic used in other similar implementations. This prevents execution when there's only one chunk, which should use the no-split kernel path.
| if (num_chunks <= 0) { | |
| if (num_chunks <= 1) { |
|
性能提升的情况也在描述里说一下 |
|
/re-run all-failed |
|
好像ci今天由于paddle版本有点问题,可以重新提交代码触发重跑 |
好的 |
Done |
|
/re-run all-failed |
MultiQueryAppendC4Attentiontemplate_instantiation into multiple cu fileMultiQueryAppendAttentiontemplate_instantiation into multiple cu fileMultiQueryDecoderAttentiontemplate_instantiation into multiple cu file-tto nvcc_compile_argsthread_num: 4
build_and_install_ops compile time: 03:53:29 -> 02:58:20