Opt scaled_act_and_mul #1918

yzhou103 · 2026-01-28T03:35:33Z

Motivation

Technical Details

change max wave num to be 16 if n>=16384
update output alloc in test

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

… in Batch Prefill kernel (#1754) * add page size 16 to test and op * add num_total_pages to kernel parameter * add is_sglang parameter * chang is_sglang to is_sglang_layout * kv last page size=16 pass * pass kv_last_page_lens to kernel * add parameters check before calling kernel * change kv layout to [page_num, page_size, nhead, hdim] * adopt the changes of struct fmha_fwd_batch_prefill_traits * change kv cache memory layout to [num_blocks, num_kv_heads, head_size/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8] * [FMHA] Integrate vLLM block table support and enforce vectorized KV layout Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout. **Key Changes:** * **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces. * **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers. * **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments. * **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout. * update CK * update ck * adopt api changes from fmha_batch_prefill_traits * add support for linear kv cache layout * update api * Refactor the test code by gathering the different test functions into one * update ck * update ck * Add profile measurements for batch prefill function * update ck * fix style * fix style * [FMHA] Support 3D linear layout (page_size=1) and non-contiguous KV tensors in batch prefill - Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1. - Relax contiguity checks to only require the last dimension to be contiguous. - Update C++ stride calculations for 3D, 4D, and 5D layouts. - Add tests for 3D layout and non-contiguous KV cache. * update ck --------- Co-authored-by: ltqin <[email protected]>

* Standardize pattern of GMM kernel config file * Refactor `get_gemm_config` calls to pass hardcoded strings as config names * Add Python script to select which Triton tests to run * Write tests to run to environment file * Remove timestamps from logging messages GitHub already has a "show timestamps" feature, so logging timestamps isn't adding anything. * [CI] Run test selection script on CI job * Add benchmarks and test selection script to Triton paths filter. * Install NetworkX dependency of test selection script. * Fetch target branch from remote. * Run test selection script. * Comment out writing to `GITHUB_ENV` file The 1st stage of test selection script is dry-run only. We'll evaluate its correctness in the wild over time and, later on, fully enable it.

… Tensor Access in PA Decode Gluon (#1774) * Refactor query loading to use MTP 3D layout in paged attention decode - Replace query strides (seq, head) with (bs, qlen, kv_head, group_size) - Add mtp_blocked_query_layout for [seq_len, group_size, head_size] tensor - Load query directly with 3D layout and reshape, removing transpose_query_gluon dependency - Add QUERY_SEQ_LEN=4 and QUERY_GROUP_SIZE_ONE_Q=8 constants * rm qo trans part1 * rm qo trans part2 * Fix MTP layout index conversion for exp_sums, max_logits and temporary_output - Convert MTP layout indices to continuous indices for exp_sums/max_logits access - Convert MTP layout indices to continuous indices for temporary_output access - Fix OUTPUT_GROUP_SIZE boundary check to use OUTPUT_SEQ_LEN_POW2 instead of OUTPUT_SEQ_LEN - Add proper index conversion in reduce kernel for reading from temporary_output * Fix MTP layout mask and index conversion in paged attention decode gluon kernel - Rename qk_row_mask_3d/1d to query_row_mask_3d/1d for clarity - Add separate pv_row_mask for PV operations with proper layout conversion - Fix max_logits_base_offsets to convert MTP layout indices to continuous indices - Fix output_group_offsets to convert MTP layout indices to continuous indices - Use qk_row_mask and pv_row_mask for consistency with paged_attention_decode_v2_gluon_dot_kernel - Apply same fixes to paged_attention_decode_sliding_window kernel * Simplify pa_decode_gluon interface and remove unused transpose kernels - Add TORCH_TO_TL_DTYPE dictionary for torch to triton dtype conversion - Remove unused transpose_query_gluon_kernel and transpose_query_gluon functions - Remove unused transpose_output_gluon_kernel and transpose_output_gluon functions - Simplify pa_decode_gluon function signature by removing intermediate gluon tensors (output_gluon, query_gluon, query_scale_gluon parameters) - Change compute_type parameter from tl.dtype to torch.dtype - Remove commented-out transpose code blocks * Refactor: rename QUERY_GROUP_SIZE_ORIGINAL to ONE_QUERY_GROUP_SIZE for clarity - Rename QUERY_GROUP_SIZE_ORIGINAL to ONE_QUERY_GROUP_SIZE in kernel parameters - Rename QUERY_GROUP_SIZE_ONE_Q_POW2 to ONE_QUERY_GROUP_SIZE_POW2 for consistency - Rename OUTPUT_GROUP_SIZE to ONE_OUTPUT_GROUP_SIZE in reduce kernel - Remove redundant query_group_size_original runtime parameter - Clean up unused variables and optimize output storage in sliding window kernel - Fix sink token loading offset calculation * Refactor PA decode gluon AOT: remove transpose kernels and optimize implementation - Remove transpose_query_gluon_kernel and transpose_output_gluon_kernel - Remove transpose_query_output_gluon_aot.py and prebuild script - Update pa_decode_gluon_aot implementation with optimizations - Simplify attention and reduce kernel templates - Update tests to reflect new implementation * black format * Refactor: clean up unused imports, variables, and dead code in pa_gluon_aot modules - Remove unused imports (Optional, run_perftest, not_built, etc.) - Remove unused variables and redundant tensor creation code - Remove dead code branches in pa_decode_gluon_aot_prebuild.py - Fix f-string warnings where no variables were interpolated - Reorganize import statements for better clarity * Refactor pa_decode_gluon test and prebuild scripts - Simplify run_gluon_kernel interface by removing redundant parameters - Add support for sinks and sliding_window options in test configurations - Refactor compute_types_quant_q_and_kv handling for better flexibility - Rename and reorganize test functions (simple_test -> normal_accuracy_test, etc.) - Improve test result output with grouped statistics by compute_type - Update default process count and tolerance values - Add get_so_files_size_and_count utility function * Refactor PA decode gluon tests: add pytest support and remove obsolete test file - Add pytest support to test_pa_decode_gluon.py with parametrized test_multi_case_set - Update TEST_NAME to 'main.normal_accuracy_performance.jit' - Enable sliding_window accuracy and performance tests in main - Add assertion failure message for better test feedback - Apply code formatting improvements to pa_decode_gluon_aot_prebuild.py - Remove obsolete test_transpose_query_output_gluon.py * merge swa * ps pa merge * add missing comments * fix workspace size * fix test * Update paged attention decode gluon test: add head dimension tests and enable sliding window tests - Add test cases for different head dimensions (64, 192, 256) - Disable USE_TORCH_FLASH_REF_OPTIONS in performance tests - Enable sliding_window_accuracy_test and sliding_window_performance_test * Remove torch_to_triton_dtype import to fix multiprocessing error Remove 'from aiter.ops.triton.utils.types import torch_to_triton_dtype' import as it causes errors in multiprocessing Pool.map() calls in pa_decode_gluon_aot_prebuild.py. Changes: - Replace torch_to_triton_dtype with TORCH_TO_TL_DTYPE in pa_decode_gluon.py - Remove unused torch_to_triton_dtype import and conversion in pa_decode_gluon_aot.py - Rename max_context_length to max_context_partition_num for clarity * fix bug * Add Triton version compatibility for MFMA instruction shape configuration - Add parse_triton_version() function to handle version string parsing - Add TRITON_VERSION_GE_3_6_0 constexpr flag for version checking - Update AMDMFMALayout instr_shape to use dynamic configuration based on Triton version (3.6.0+) - Set MFMA_INSTR_K based on COMPUTE_TYPE and CDNA_VERSION - Refactor constants and validation code organization - Enable performance tests in test_pa_decode_gluon.py * Fix Triton 3.6.0 compatibility in PA decode reduce kernel and improve test robustness - Add version-specific tensor dimension ordering in paged_attention_decode_v2_reduce_kernel for Triton >= 3.6.0 compatibility - Skip permute operation for Triton 3.6.0+ as it handles dimensions differently - Switch test mode back to JIT from AOT - Add empty DataFrame check before analyzing test results to prevent errors * Refactor dtype mapping and improve AOT prebuild multiprocessing - Refactor: Use shared torch_to_triton_dtype from utils.types instead of duplicated TORCH_TO_TL_DTYPE - Fix: Use ProcessPoolExecutor with spawn context to avoid CUDA reinitialization issues in parallel test execution - Add: Extended AOT prebuild test configurations for head dimensions (64, 192, 256) - Add: New test case set options (normal_accuracy_aot, sliding_window_performance) - Format: Apply black code formatting * Fix argument parsing conflict when running via pytest - Add sys import for argv access - Detect pytest execution environment and use empty args to avoid conflicts with pytest's command line arguments --------- Co-authored-by: fsx950223 <[email protected]>

* add mha * add attn ck * fix * fix II * fix III * fix IV * format * format II * fix5 * bug fix6 * fix 7 * bug 8 * some fix * bug fix * format * format II * update CK * update CK --------- Co-authored-by: zufayu <[email protected]> Co-authored-by: amd-ruitang3 <[email protected]>

* rebase and init moe optimization * add avg col in common * fix

Aiter fails import test with error ModuleNotFoundError: No module named 'packaging' The aiter package imports and uses 'packaging' module at runtime in multiple files, but only declares it in setup_requires (build-time) instead of also declaring in install_requires (runtime). This causes "ModuleNotFoundError: No module named 'packaging'" when importing aiter in environments where 'packaging' is not already installed. This PR looks to fix this issue by patching setup.py to include aiter packaging as a runtime dependency. Signed-off-by: Anu Oguntayo <[email protected]>

GitHub Actions CI pipeline is aborted if a process exits with a code other than zero. This commit fixes a bug in Triton test selection script, no matter the test selection outcome, CI pipeline shouldn't be aborted.

* enable gptoss_sink Signed-off-by: Linjun-AMD <[email protected]> * Update csrc/py_itfs_ck/mha_batch_prefill_kernels.cu Co-authored-by: Copilot <[email protected]> * Update mha_batch_prefill_kernels.cu * update mha_bwd parameter Signed-off-by: Linjun-AMD <[email protected]> * Update mha.py * Fix formatting for bias argument in rocm_ops.hpp * fix some format error Signed-off-by: Linjun-AMD <[email protected]> * Update mha.py * update args Signed-off-by: Linjun-AMD <[email protected]> * Update mha_fwd.cpp * update ck commit Signed-off-by: Linjun-AMD <[email protected]> * use atier main branch ck commit Signed-off-by: Linjun-AMD <[email protected]> * update ck commit Signed-off-by: Linjun-AMD <[email protected]> * Update mha_batch_prefill_kernels.cu --------- Signed-off-by: Linjun-AMD <[email protected]> Co-authored-by: Copilot <[email protected]>

Co-authored-by: solin <[email protected]> Co-authored-by: Xin Huang <[email protected]>

* fix(paps): fix support for multi kheads Signed-off-by: Double Young <[email protected]> * fix(paps): fix reset work_indptr and use empty init in ut Signed-off-by: Double Young <[email protected]> --------- Signed-off-by: Double Young <[email protected]>

…#1762) * [Docs] Add README for Triton Ops detailing general maintenance points

* initial commit * fix * test ck tile tuning * temp save * tem save * refactor * fix tile * support ck tile abquant * fix error * fix error * fix error * fix error * fix error * test tuning * fix tile compile error * add more tile instance * test tile instance tuning * add more valid instances * fix test bug * fix default tile instance * fix * fix actions error * format code style * Apply Black 25.12.0 formatting to match CI * fix CI * fix CI * rename lagacy * add profile result * update ck * code format * fix mismatch ck kernel * fix CI * delete tune flag * update ck * merge aiter main branch

* Testing fake_tensor fix * Same logic for var len attn * Fix --------- Co-authored-by: Lingpeng Jin <[email protected]>

…ernally (#1821) * Implement a new api that will be switching between asm and hip pa Inference engines should be calling paged_attention_common now with shuffled kv cache layout and aiter internally will decide between asm or hip kernel. HIP is more performant for lower concurrencies ( < 128). Also a unit test has been updated to include the new interface. Note that support for the shuffled scales in HIP is not supported and is always redirected to asm now when KV cache is in int8 or fp8 formats. * Delete op_tests/README_pa_merged_tests.md * Delete op_tests/test_pa_merged.py * Fix formatting according to Black requirements * Fix one last place with broken formatting * Remove modification to pa_v1, we already have pa for 5D kv cache * Fix another formatting issue * Add proper quant support for the common API * Apply formatting * Remove redundant parameters * Remove redundant parameters --------- Co-authored-by: Sergey Solo <[email protected]> Co-authored-by: Mikko Tukiainen <[email protected]>

* add_tune_dsfp4_gemm * update * Update aiter/jit/core.py Co-authored-by: Copilot <[email protected]> --------- Co-authored-by: Copilot <[email protected]>

* Fused rope_kv and bmm * Apply suggestion from @github-actions[bot] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Apply suggestion from @github-actions[bot] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update fused_bmm_rope_kv_cache.py * Update fused_bmm_rope_kv_cache.py * add test * update * update * parse bmm config * fp8 API and kernel change * fp8 UT * Apply suggestion from @github-actions[bot] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Apply suggestion from @github-actions[bot] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Formatting with black * pytest skip if fp4/8 is not avail on device * code format with black --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ShaoChunLee <[email protected]>

* spkil mla_prefill_ps when gfx942 * use chip info

* Causal conv1d triton sglang

…rage speedup on MI300X (#1879)

* CI: Temporarily use previous vllm nightly image * Update vllm_benchmark.yaml * Update vllm_benchmark.yaml

…1868) * Do not parse AITER source files that have hard to track dependencies These source files can be safely ignored for Triton test selection purposes. * Do not tag `__init__.py` files as kernels * Resolve Gluon config files by replacing GEMM M/N/K placeholders * Refactor JSON string resolution * Apply `markdownlint` suggestions to Triton ops `README.md` * Document test selection script in Triton ops `README.md` * Document `expand_mnk` function

* added gluon kernel for gemm_afp4wfp4 * updated the gemm_afp4wfp4 test and bench scripts to support running the kernel

* first version:add num_kv_head_dim * opt scalar gpr * fix k=128 error and refactor code * fix test * fix test * add interface comment * fix test format * update * update test

* add asm topksoftmax for bf16 input [gfx942] * fix fused_moe_dp_share_expert * update test * add asm topksoftmax for bf16 input [gfx950] * add 128 expert topk4 * fix * fix2 * fix3 --------- Co-authored-by: amd-ruitang3 <[email protected]>

* fix: pa_decode_gluon import cause ci error * docs: black format * format with black --------- Co-authored-by: onealliu <[email protected]> Co-authored-by: zufayu <[email protected]>

* great refactor mla reduce * includes are wrongly ordered by clang-format

…pport 7.2 (#1906) * CI: Temporarily switch Aiter tests back to ROCm 7.1 as CK does not support 7.2 * Update aiter-test.yaml * Update .github/workflows/aiter-test.yaml Co-authored-by: Copilot <[email protected]> --------- Co-authored-by: Copilot <[email protected]>

#1907) * add add_rmsnorm_quant_kernel v1 * add add_rmsnorm * fix address offset int32 overflow && optimize kernel * add block quant * add rmsnorm/rmsnorm_quant api and rewrite ut * fix fp4 quant and optimize kernel dispatch * add some check * optimize kernel some instances use NT load * optimize kernel 2 * update api and add kernel instance * update * format

* add fp8 block scale quantization stride parameters * add start ptr and block size parameters * update ck develop version * update block scale parameters * format code --------- Co-authored-by: Xin Huang <[email protected]>

* Added lookups for moe_routing

* Fix issue on calculation of wg per batch and head. * Move loading reduce_partial_map to simple and massive pipeline respectively. * clang-format Fix clang-format issue

Signed-off-by: Double Young <[email protected]>

yzhou103 · 2026-01-28T03:41:32Z

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

yzhou103 and others added 30 commits January 26, 2026 07:41

opt scaled_act_and_mul for n >= 16384

6072776

update test_activation.py

3764e7b

update AITER_ASM_DIR (#1812)

207e099

Add device guard for hipbsolgemm kernel launch (#1824)

6af3933

support atom pa ps reduce shape (#1820)

aa4d80e

CI: Add fmoe in auto tuning pipeline (#1827)

e39de2c

CI: Fix Triton tests on main branch (#1828)

b2e5399

Fix INT4 QR TP8 boundary condition (#1834)

2b05c88

Capture num info in moe sorting for ATOM ci test (#1835)

7e666f7

update multi-k mfma, update utility (#1798)

7805fc1

rebase and init moe optimization (#1814)

f2f7984

* rebase and init moe optimization * add avg col in common * fix

fix async instruction not supported issue (#1842)

0db9efa

[TRITON] Fix the mistake in the unified attention comments (#1832)

250c858

[TRITON][CI] Fix test selection script exit code bug (#1846)

f0f11df

GitHub Actions CI pipeline is aborted if a process exits with a code other than zero. This commit fixes a bug in Triton test selection script, no matter the test selection outcome, CI pipeline shouldn't be aborted.

fix test_gemm_a8w8.py error on gfx942 (#1841)

aae9411

Co-authored-by: solin <[email protected]> Co-authored-by: Xin Huang <[email protected]>

CI: Enable Aiter tests on MI325 nodes (#1836)

d5ddc70

[Docs] Add README for Triton Ops detailing general maintenance points (…

574159b

…#1762) * [Docs] Add README for Triton Ops detailing general maintenance points

Fix fake_tensor implementation for mha (#1844)

e8d09b6

* Testing fake_tensor fix * Same logic for var len attn * Fix --------- Co-authored-by: Lingpeng Jin <[email protected]>

CI: Fix vllm benchmark test (#1852)

7e5608a

add_tune_dsfp4_gemm (#1856)

37be878

* add_tune_dsfp4_gemm * update * Update aiter/jit/core.py Co-authored-by: Copilot <[email protected]> --------- Co-authored-by: Copilot <[email protected]>

ZhangLirong-amd and others added 24 commits January 28, 2026 03:29

spkil mla_prefill_ps when gfx942 (#1870)

8e5bd87

* spkil mla_prefill_ps when gfx942 * use chip info

fix error (#1871)

58b58d4

CI: Enabling Aiter test on MI325 in PR tests (#1872)

ad32f7b

[Triton] Causal conv1d fwd (#1855)

b0fb4ec

* Causal conv1d triton sglang

Optimize sigmoid activation with AMD fast math intrinsics for 20% ave…

cb94152

…rage speedup on MI300X (#1879)

CI: Temporarily use previous vllm nightly image (#1881)

7a6bfcd

* CI: Temporarily use previous vllm nightly image * Update vllm_benchmark.yaml * Update vllm_benchmark.yaml

[CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#1873)

477cd5a

add custom_allreduce write mode (#1880)

a4b606a

Fix kernel filter used in mha_fwd() & mha_varlen_fwd() (#1883)

054415c

[TRITON] gluon gemm_afp4wfp4 (#1809)

4764c26

* added gluon kernel for gemm_afp4wfp4 * updated the gemm_afp4wfp4 test and bench scripts to support running the kernel

Add kv head in fused qk concat (#1874)

7d606e7

* first version:add num_kv_head_dim * opt scalar gpr * fix k=128 error and refactor code * fix test * fix test * add interface comment * fix test format * update * update test

add asm topksoftmax for bf16 input (#1887)

b950f29

* add asm topksoftmax for bf16 input [gfx942] * fix fused_moe_dp_share_expert * update test * add asm topksoftmax for bf16 input [gfx950] * add 128 expert topk4 * fix * fix2 * fix3 --------- Co-authored-by: amd-ruitang3 <[email protected]>

Pr-1893 format (#1897)

146ef0a

* fix: pa_decode_gluon import cause ci error * docs: black format * format with black --------- Co-authored-by: onealliu <[email protected]> Co-authored-by: zufayu <[email protected]>

fix tolerance for large shape for test_fused_add_rmsnorm (#1899)

164019e

Refactor MLA Reduce (#1896)

528d848

* great refactor mla reduce * includes are wrongly ordered by clang-format

add a compile macro AITER_ASM_DIR (#1862)

623a48f

Add fp8 block scale quantization parameters (#1892)

d33f1b7

* add fp8 block scale quantization stride parameters * add start ptr and block size parameters * update ck develop version * update block scale parameters * format code --------- Co-authored-by: Xin Huang <[email protected]>

Added lookups for moe_routing (#1890)

88d436d

* Added lookups for moe_routing

Fix calculation on number of workgroup per batch and head (#1908)

353fefc

* Fix issue on calculation of wg per batch and head. * Move loading reduce_partial_map to simple and massive pipeline respectively. * clang-format Fix clang-format issue

fix(paps): fix even divide for multi-khead (#1898)

011ae66

Signed-off-by: Double Young <[email protected]>

update

5e306d5

yzhou103 requested review from a team and Copilot January 28, 2026 03:35

Merge branch 'main' into opt_activation

558efb0

fix lint

f838f76

Copilot AI reviewed Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opt scaled_act_and_mul #1918

Opt scaled_act_and_mul #1918

Uh oh!

yzhou103 commented Jan 28, 2026 •

edited

Loading

Uh oh!

yzhou103 commented Jan 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Opt scaled_act_and_mul #1918

Are you sure you want to change the base?

Opt scaled_act_and_mul #1918

Uh oh!

Conversation

yzhou103 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

yzhou103 commented Jan 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

yzhou103 commented Jan 28, 2026 •

edited

Loading