Skip to content

Conversation

@yzhou103
Copy link
Contributor

@yzhou103 yzhou103 commented Jan 28, 2026

Motivation

Technical Details

change max wave num to be 16 if n>=16384
update output alloc in test

Test Plan

Test Result

Submission Checklist

yzhou103 and others added 30 commits January 26, 2026 07:41
… in Batch Prefill kernel (#1754)

* add page size 16 to test and op

* add num_total_pages to kernel parameter

* add is_sglang parameter

* chang is_sglang to is_sglang_layout

* kv last page size=16 pass

* pass kv_last_page_lens to kernel

* add parameters check before calling kernel

* change kv layout to [page_num, page_size, nhead, hdim]

* adopt the changes of struct fmha_fwd_batch_prefill_traits

* change kv cache memory layout to [num_blocks, num_kv_heads, head_size/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8]

* [FMHA] Integrate vLLM block table support and enforce vectorized KV layout

Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout.

**Key Changes:**
*   **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces.
*   **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers.
*   **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments.
*   **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout.

* update CK

* update ck

* adopt api changes from fmha_batch_prefill_traits

* add support for linear kv cache layout

* update api

* Refactor the test code by gathering the different test functions into one

* update ck

* update ck

* Add profile measurements for batch prefill function

* update ck

* fix style

* fix style

* [FMHA] Support 3D linear layout (page_size=1) and non-contiguous KV tensors in batch prefill

- Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1.
- Relax contiguity checks to only require the last dimension to be contiguous.
- Update C++ stride calculations for 3D, 4D, and 5D layouts.
- Add tests for 3D layout and non-contiguous KV cache.

* update ck

---------

Co-authored-by: ltqin <[email protected]>
* Standardize pattern of GMM kernel config file

* Refactor `get_gemm_config` calls to pass hardcoded strings as config names

* Add Python script to select which Triton tests to run

* Write tests to run to environment file

* Remove timestamps from logging messages

GitHub already has a "show timestamps" feature, so logging timestamps
isn't adding anything.

* [CI] Run test selection script on CI job

* Add benchmarks and test selection script to Triton paths filter.
* Install NetworkX dependency of test selection script.
* Fetch target branch from remote.
* Run test selection script.

* Comment out writing to `GITHUB_ENV` file

The 1st stage of test selection script is dry-run only. We'll evaluate
its correctness in the wild over time and, later on, fully enable it.
… Tensor Access in PA Decode Gluon (#1774)

* Refactor query loading to use MTP 3D layout in paged attention decode

- Replace query strides (seq, head) with (bs, qlen, kv_head, group_size)
- Add mtp_blocked_query_layout for [seq_len, group_size, head_size] tensor
- Load query directly with 3D layout and reshape, removing transpose_query_gluon dependency
- Add QUERY_SEQ_LEN=4 and QUERY_GROUP_SIZE_ONE_Q=8 constants

* rm qo trans part1

* rm qo trans part2

* Fix MTP layout index conversion for exp_sums, max_logits and temporary_output

- Convert MTP layout indices to continuous indices for exp_sums/max_logits access
- Convert MTP layout indices to continuous indices for temporary_output access
- Fix OUTPUT_GROUP_SIZE boundary check to use OUTPUT_SEQ_LEN_POW2 instead of OUTPUT_SEQ_LEN
- Add proper index conversion in reduce kernel for reading from temporary_output

* Fix MTP layout mask and index conversion in paged attention decode gluon kernel

- Rename qk_row_mask_3d/1d to query_row_mask_3d/1d for clarity
- Add separate pv_row_mask for PV operations with proper layout conversion
- Fix max_logits_base_offsets to convert MTP layout indices to continuous indices
- Fix output_group_offsets to convert MTP layout indices to continuous indices
- Use qk_row_mask and pv_row_mask for consistency with paged_attention_decode_v2_gluon_dot_kernel
- Apply same fixes to paged_attention_decode_sliding_window kernel

* Simplify pa_decode_gluon interface and remove unused transpose kernels

- Add TORCH_TO_TL_DTYPE dictionary for torch to triton dtype conversion
- Remove unused transpose_query_gluon_kernel and transpose_query_gluon functions
- Remove unused transpose_output_gluon_kernel and transpose_output_gluon functions
- Simplify pa_decode_gluon function signature by removing intermediate gluon tensors
  (output_gluon, query_gluon, query_scale_gluon parameters)
- Change compute_type parameter from tl.dtype to torch.dtype
- Remove commented-out transpose code blocks

* Refactor: rename QUERY_GROUP_SIZE_ORIGINAL to ONE_QUERY_GROUP_SIZE for clarity

- Rename QUERY_GROUP_SIZE_ORIGINAL to ONE_QUERY_GROUP_SIZE in kernel parameters
- Rename QUERY_GROUP_SIZE_ONE_Q_POW2 to ONE_QUERY_GROUP_SIZE_POW2 for consistency
- Rename OUTPUT_GROUP_SIZE to ONE_OUTPUT_GROUP_SIZE in reduce kernel
- Remove redundant query_group_size_original runtime parameter
- Clean up unused variables and optimize output storage in sliding window kernel
- Fix sink token loading offset calculation

* Refactor PA decode gluon AOT: remove transpose kernels and optimize implementation

- Remove transpose_query_gluon_kernel and transpose_output_gluon_kernel
- Remove transpose_query_output_gluon_aot.py and prebuild script
- Update pa_decode_gluon_aot implementation with optimizations
- Simplify attention and reduce kernel templates
- Update tests to reflect new implementation

* black format

* Refactor: clean up unused imports, variables, and dead code in pa_gluon_aot modules

- Remove unused imports (Optional, run_perftest, not_built, etc.)
- Remove unused variables and redundant tensor creation code
- Remove dead code branches in pa_decode_gluon_aot_prebuild.py
- Fix f-string warnings where no variables were interpolated
- Reorganize import statements for better clarity

* Refactor pa_decode_gluon test and prebuild scripts

- Simplify run_gluon_kernel interface by removing redundant parameters
- Add support for sinks and sliding_window options in test configurations
- Refactor compute_types_quant_q_and_kv handling for better flexibility
- Rename and reorganize test functions (simple_test -> normal_accuracy_test, etc.)
- Improve test result output with grouped statistics by compute_type
- Update default process count and tolerance values
- Add get_so_files_size_and_count utility function

* Refactor PA decode gluon tests: add pytest support and remove obsolete test file

- Add pytest support to test_pa_decode_gluon.py with parametrized test_multi_case_set
- Update TEST_NAME to 'main.normal_accuracy_performance.jit'
- Enable sliding_window accuracy and performance tests in main
- Add assertion failure message for better test feedback
- Apply code formatting improvements to pa_decode_gluon_aot_prebuild.py
- Remove obsolete test_transpose_query_output_gluon.py

* merge swa

* ps pa merge

* add missing comments

* fix workspace size

* fix test

* Update paged attention decode gluon test: add head dimension tests and enable sliding window tests

- Add test cases for different head dimensions (64, 192, 256)
- Disable USE_TORCH_FLASH_REF_OPTIONS in performance tests
- Enable sliding_window_accuracy_test and sliding_window_performance_test

* Remove torch_to_triton_dtype import to fix multiprocessing error

Remove 'from aiter.ops.triton.utils.types import torch_to_triton_dtype'
import as it causes errors in multiprocessing Pool.map() calls in
pa_decode_gluon_aot_prebuild.py.

Changes:
- Replace torch_to_triton_dtype with TORCH_TO_TL_DTYPE in pa_decode_gluon.py
- Remove unused torch_to_triton_dtype import and conversion in pa_decode_gluon_aot.py
- Rename max_context_length to max_context_partition_num for clarity

* fix bug

* Add Triton version compatibility for MFMA instruction shape configuration

- Add parse_triton_version() function to handle version string parsing
- Add TRITON_VERSION_GE_3_6_0 constexpr flag for version checking
- Update AMDMFMALayout instr_shape to use dynamic configuration based on Triton version (3.6.0+)
- Set MFMA_INSTR_K based on COMPUTE_TYPE and CDNA_VERSION
- Refactor constants and validation code organization
- Enable performance tests in test_pa_decode_gluon.py

* Fix Triton 3.6.0 compatibility in PA decode reduce kernel and improve test robustness

- Add version-specific tensor dimension ordering in paged_attention_decode_v2_reduce_kernel
  for Triton >= 3.6.0 compatibility
- Skip permute operation for Triton 3.6.0+ as it handles dimensions differently
- Switch test mode back to JIT from AOT
- Add empty DataFrame check before analyzing test results to prevent errors

* Refactor dtype mapping and improve AOT prebuild multiprocessing

- Refactor: Use shared torch_to_triton_dtype from utils.types instead of duplicated TORCH_TO_TL_DTYPE
- Fix: Use ProcessPoolExecutor with spawn context to avoid CUDA reinitialization issues in parallel test execution
- Add: Extended AOT prebuild test configurations for head dimensions (64, 192, 256)
- Add: New test case set options (normal_accuracy_aot, sliding_window_performance)
- Format: Apply black code formatting

* Fix argument parsing conflict when running via pytest

- Add sys import for argv access
- Detect pytest execution environment and use empty args to avoid
  conflicts with pytest's command line arguments

---------

Co-authored-by: fsx950223 <[email protected]>
* add mha

* add attn ck

* fix

* fix II

* fix III

* fix IV

* format

* format II

* fix5

* bug fix6

* fix 7

* bug 8

* some fix

* bug fix

* format

* format II

* update CK

* update CK

---------

Co-authored-by: zufayu <[email protected]>
Co-authored-by: amd-ruitang3 <[email protected]>
* rebase and init moe optimization

* add avg col in common

* fix
Aiter fails import test with error ModuleNotFoundError: No module named 'packaging'

The aiter package imports and uses 'packaging' module at runtime in
multiple files, but only declares it in setup_requires (build-time) instead of also declaring in install_requires (runtime).
This causes "ModuleNotFoundError: No module named 'packaging'" when
importing aiter in environments where 'packaging' is not already installed.

This PR looks to fix this issue by patching setup.py to include aiter packaging as a runtime dependency.

Signed-off-by: Anu Oguntayo <[email protected]>
GitHub Actions CI pipeline is aborted if a process exits with a code other than
zero. This commit fixes a bug in Triton test selection script, no matter the
test selection outcome, CI pipeline shouldn't be aborted.
* enable gptoss_sink

Signed-off-by: Linjun-AMD <[email protected]>

* Update csrc/py_itfs_ck/mha_batch_prefill_kernels.cu

Co-authored-by: Copilot <[email protected]>

* Update mha_batch_prefill_kernels.cu

* update mha_bwd parameter

Signed-off-by: Linjun-AMD <[email protected]>

* Update mha.py

* Fix formatting for bias argument in rocm_ops.hpp

* fix some format error

Signed-off-by: Linjun-AMD <[email protected]>

* Update mha.py

* update args

Signed-off-by: Linjun-AMD <[email protected]>

* Update mha_fwd.cpp

* update ck commit

Signed-off-by: Linjun-AMD <[email protected]>

* use atier main branch ck commit

Signed-off-by: Linjun-AMD <[email protected]>

* update ck commit

Signed-off-by: Linjun-AMD <[email protected]>

* Update mha_batch_prefill_kernels.cu

---------

Signed-off-by: Linjun-AMD <[email protected]>
Co-authored-by: Copilot <[email protected]>
* fix(paps): fix support for multi kheads

Signed-off-by: Double Young <[email protected]>

* fix(paps): fix reset work_indptr and use empty init in ut

Signed-off-by: Double Young <[email protected]>

---------

Signed-off-by: Double Young <[email protected]>
…#1762)

* [Docs] Add README for Triton Ops detailing general maintenance points
* initial commit

* fix

* test ck tile tuning

* temp save

* tem save

* refactor

* fix tile

* support ck tile abquant

* fix error

* fix error

* fix error

* fix error

* fix error

* test tuning

* fix tile compile error

* add more tile instance

* test tile instance tuning

* add more valid instances

* fix test bug

* fix default tile instance

* fix

* fix actions error

* format code style

* Apply Black 25.12.0 formatting to match CI

* fix CI

* fix CI

* rename lagacy

* add profile result

* update ck

* code format

* fix mismatch ck kernel

* fix CI

* delete tune flag

* update ck

* merge aiter main branch
* Testing fake_tensor fix

* Same logic for var len attn

* Fix

---------

Co-authored-by: Lingpeng Jin <[email protected]>
…ernally (#1821)

* Implement a new api that will be switching between asm and hip pa

Inference engines should be calling paged_attention_common now with
shuffled kv cache layout and aiter internally will decide between asm
or hip kernel. HIP is more performant for lower concurrencies ( < 128).
Also a unit test has been updated to include the new interface.

Note that support for the shuffled scales in HIP is not supported and is
always redirected to asm now when KV cache is  in int8 or fp8 formats.

* Delete op_tests/README_pa_merged_tests.md

* Delete op_tests/test_pa_merged.py

* Fix formatting according to Black requirements

* Fix one last place with broken formatting

* Remove modification to pa_v1, we already have pa for 5D kv cache

* Fix another formatting issue

* Add proper quant support for the common API

* Apply formatting

* Remove redundant parameters

* Remove redundant parameters

---------

Co-authored-by: Sergey Solo <[email protected]>
Co-authored-by: Mikko Tukiainen <[email protected]>
* add_tune_dsfp4_gemm

* update

* Update aiter/jit/core.py

Co-authored-by: Copilot <[email protected]>

---------

Co-authored-by: Copilot <[email protected]>
* Fused rope_kv and bmm

* Apply suggestion from @github-actions[bot]

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Apply suggestion from @github-actions[bot]

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update fused_bmm_rope_kv_cache.py

* Update fused_bmm_rope_kv_cache.py

* add test

* update

* update

* parse bmm config

* fp8 API and kernel change

* fp8 UT

* Apply suggestion from @github-actions[bot]

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Apply suggestion from @github-actions[bot]

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Formatting with black

* pytest skip if fp4/8 is not avail on device

* code format with black

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ShaoChunLee <[email protected]>
ZhangLirong-amd and others added 24 commits January 28, 2026 03:29
* spkil mla_prefill_ps when gfx942

* use chip info
* Causal conv1d triton sglang
* CI: Temporarily use previous vllm nightly image

* Update vllm_benchmark.yaml

* Update vllm_benchmark.yaml
…1868)

* Do not parse AITER source files that have hard to track dependencies

These source files can be safely ignored for Triton test selection
purposes.

* Do not tag `__init__.py` files as kernels

* Resolve Gluon config files by replacing GEMM M/N/K placeholders

* Refactor JSON string resolution

* Apply `markdownlint` suggestions to Triton ops `README.md`

* Document test selection script in Triton ops `README.md`

* Document `expand_mnk` function
* added gluon kernel for gemm_afp4wfp4

* updated the gemm_afp4wfp4 test and bench scripts to support running the kernel
* first version:add num_kv_head_dim

* opt scalar gpr

* fix k=128 error and refactor code

* fix test

* fix test

* add interface comment

* fix test format

* update

* update test
* add asm topksoftmax for bf16 input [gfx942]

* fix fused_moe_dp_share_expert

* update test

* add asm topksoftmax for bf16 input [gfx950]

* add 128 expert  topk4

* fix

* fix2

* fix3

---------

Co-authored-by: amd-ruitang3 <[email protected]>
* fix: pa_decode_gluon import cause ci error

* docs: black format

* format with black

---------

Co-authored-by: onealliu <[email protected]>
Co-authored-by: zufayu <[email protected]>
* great refactor mla reduce

* includes are wrongly ordered by clang-format
…pport 7.2 (#1906)

* CI: Temporarily switch Aiter tests back to ROCm 7.1 as CK does not support 7.2

* Update aiter-test.yaml

* Update .github/workflows/aiter-test.yaml

Co-authored-by: Copilot <[email protected]>

---------

Co-authored-by: Copilot <[email protected]>
#1907)

* add add_rmsnorm_quant_kernel v1

* add add_rmsnorm

* fix address offset  int32 overflow && optimize kernel

* add block quant

* add rmsnorm/rmsnorm_quant api  and rewrite ut

* fix fp4 quant  and optimize kernel dispatch

* add some check

* optimize kernel some instances use NT load

* optimize kernel 2

* update api  and add kernel instance

* update

* format
* add fp8 block scale quantization stride parameters

* add start ptr and block size parameters

* update ck develop version

* update block scale parameters

* format code

---------

Co-authored-by: Xin Huang <[email protected]>
* Added lookups for moe_routing
* Fix issue on calculation of wg per batch and head.

* Move loading reduce_partial_map to simple and massive pipeline respectively.

* clang-format

Fix clang-format issue
@yzhou103 yzhou103 requested review from a team and Copilot January 28, 2026 03:35
@yzhou103
Copy link
Contributor Author

image image

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.