-
Notifications
You must be signed in to change notification settings - Fork 190
Opt scaled_act_and_mul #1918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yzhou103
wants to merge
66
commits into
main
Choose a base branch
from
opt_activation
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Opt scaled_act_and_mul #1918
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… in Batch Prefill kernel (#1754) * add page size 16 to test and op * add num_total_pages to kernel parameter * add is_sglang parameter * chang is_sglang to is_sglang_layout * kv last page size=16 pass * pass kv_last_page_lens to kernel * add parameters check before calling kernel * change kv layout to [page_num, page_size, nhead, hdim] * adopt the changes of struct fmha_fwd_batch_prefill_traits * change kv cache memory layout to [num_blocks, num_kv_heads, head_size/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8] * [FMHA] Integrate vLLM block table support and enforce vectorized KV layout Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout. **Key Changes:** * **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces. * **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers. * **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments. * **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout. * update CK * update ck * adopt api changes from fmha_batch_prefill_traits * add support for linear kv cache layout * update api * Refactor the test code by gathering the different test functions into one * update ck * update ck * Add profile measurements for batch prefill function * update ck * fix style * fix style * [FMHA] Support 3D linear layout (page_size=1) and non-contiguous KV tensors in batch prefill - Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1. - Relax contiguity checks to only require the last dimension to be contiguous. - Update C++ stride calculations for 3D, 4D, and 5D layouts. - Add tests for 3D layout and non-contiguous KV cache. * update ck --------- Co-authored-by: ltqin <[email protected]>
* Standardize pattern of GMM kernel config file * Refactor `get_gemm_config` calls to pass hardcoded strings as config names * Add Python script to select which Triton tests to run * Write tests to run to environment file * Remove timestamps from logging messages GitHub already has a "show timestamps" feature, so logging timestamps isn't adding anything. * [CI] Run test selection script on CI job * Add benchmarks and test selection script to Triton paths filter. * Install NetworkX dependency of test selection script. * Fetch target branch from remote. * Run test selection script. * Comment out writing to `GITHUB_ENV` file The 1st stage of test selection script is dry-run only. We'll evaluate its correctness in the wild over time and, later on, fully enable it.
… Tensor Access in PA Decode Gluon (#1774) * Refactor query loading to use MTP 3D layout in paged attention decode - Replace query strides (seq, head) with (bs, qlen, kv_head, group_size) - Add mtp_blocked_query_layout for [seq_len, group_size, head_size] tensor - Load query directly with 3D layout and reshape, removing transpose_query_gluon dependency - Add QUERY_SEQ_LEN=4 and QUERY_GROUP_SIZE_ONE_Q=8 constants * rm qo trans part1 * rm qo trans part2 * Fix MTP layout index conversion for exp_sums, max_logits and temporary_output - Convert MTP layout indices to continuous indices for exp_sums/max_logits access - Convert MTP layout indices to continuous indices for temporary_output access - Fix OUTPUT_GROUP_SIZE boundary check to use OUTPUT_SEQ_LEN_POW2 instead of OUTPUT_SEQ_LEN - Add proper index conversion in reduce kernel for reading from temporary_output * Fix MTP layout mask and index conversion in paged attention decode gluon kernel - Rename qk_row_mask_3d/1d to query_row_mask_3d/1d for clarity - Add separate pv_row_mask for PV operations with proper layout conversion - Fix max_logits_base_offsets to convert MTP layout indices to continuous indices - Fix output_group_offsets to convert MTP layout indices to continuous indices - Use qk_row_mask and pv_row_mask for consistency with paged_attention_decode_v2_gluon_dot_kernel - Apply same fixes to paged_attention_decode_sliding_window kernel * Simplify pa_decode_gluon interface and remove unused transpose kernels - Add TORCH_TO_TL_DTYPE dictionary for torch to triton dtype conversion - Remove unused transpose_query_gluon_kernel and transpose_query_gluon functions - Remove unused transpose_output_gluon_kernel and transpose_output_gluon functions - Simplify pa_decode_gluon function signature by removing intermediate gluon tensors (output_gluon, query_gluon, query_scale_gluon parameters) - Change compute_type parameter from tl.dtype to torch.dtype - Remove commented-out transpose code blocks * Refactor: rename QUERY_GROUP_SIZE_ORIGINAL to ONE_QUERY_GROUP_SIZE for clarity - Rename QUERY_GROUP_SIZE_ORIGINAL to ONE_QUERY_GROUP_SIZE in kernel parameters - Rename QUERY_GROUP_SIZE_ONE_Q_POW2 to ONE_QUERY_GROUP_SIZE_POW2 for consistency - Rename OUTPUT_GROUP_SIZE to ONE_OUTPUT_GROUP_SIZE in reduce kernel - Remove redundant query_group_size_original runtime parameter - Clean up unused variables and optimize output storage in sliding window kernel - Fix sink token loading offset calculation * Refactor PA decode gluon AOT: remove transpose kernels and optimize implementation - Remove transpose_query_gluon_kernel and transpose_output_gluon_kernel - Remove transpose_query_output_gluon_aot.py and prebuild script - Update pa_decode_gluon_aot implementation with optimizations - Simplify attention and reduce kernel templates - Update tests to reflect new implementation * black format * Refactor: clean up unused imports, variables, and dead code in pa_gluon_aot modules - Remove unused imports (Optional, run_perftest, not_built, etc.) - Remove unused variables and redundant tensor creation code - Remove dead code branches in pa_decode_gluon_aot_prebuild.py - Fix f-string warnings where no variables were interpolated - Reorganize import statements for better clarity * Refactor pa_decode_gluon test and prebuild scripts - Simplify run_gluon_kernel interface by removing redundant parameters - Add support for sinks and sliding_window options in test configurations - Refactor compute_types_quant_q_and_kv handling for better flexibility - Rename and reorganize test functions (simple_test -> normal_accuracy_test, etc.) - Improve test result output with grouped statistics by compute_type - Update default process count and tolerance values - Add get_so_files_size_and_count utility function * Refactor PA decode gluon tests: add pytest support and remove obsolete test file - Add pytest support to test_pa_decode_gluon.py with parametrized test_multi_case_set - Update TEST_NAME to 'main.normal_accuracy_performance.jit' - Enable sliding_window accuracy and performance tests in main - Add assertion failure message for better test feedback - Apply code formatting improvements to pa_decode_gluon_aot_prebuild.py - Remove obsolete test_transpose_query_output_gluon.py * merge swa * ps pa merge * add missing comments * fix workspace size * fix test * Update paged attention decode gluon test: add head dimension tests and enable sliding window tests - Add test cases for different head dimensions (64, 192, 256) - Disable USE_TORCH_FLASH_REF_OPTIONS in performance tests - Enable sliding_window_accuracy_test and sliding_window_performance_test * Remove torch_to_triton_dtype import to fix multiprocessing error Remove 'from aiter.ops.triton.utils.types import torch_to_triton_dtype' import as it causes errors in multiprocessing Pool.map() calls in pa_decode_gluon_aot_prebuild.py. Changes: - Replace torch_to_triton_dtype with TORCH_TO_TL_DTYPE in pa_decode_gluon.py - Remove unused torch_to_triton_dtype import and conversion in pa_decode_gluon_aot.py - Rename max_context_length to max_context_partition_num for clarity * fix bug * Add Triton version compatibility for MFMA instruction shape configuration - Add parse_triton_version() function to handle version string parsing - Add TRITON_VERSION_GE_3_6_0 constexpr flag for version checking - Update AMDMFMALayout instr_shape to use dynamic configuration based on Triton version (3.6.0+) - Set MFMA_INSTR_K based on COMPUTE_TYPE and CDNA_VERSION - Refactor constants and validation code organization - Enable performance tests in test_pa_decode_gluon.py * Fix Triton 3.6.0 compatibility in PA decode reduce kernel and improve test robustness - Add version-specific tensor dimension ordering in paged_attention_decode_v2_reduce_kernel for Triton >= 3.6.0 compatibility - Skip permute operation for Triton 3.6.0+ as it handles dimensions differently - Switch test mode back to JIT from AOT - Add empty DataFrame check before analyzing test results to prevent errors * Refactor dtype mapping and improve AOT prebuild multiprocessing - Refactor: Use shared torch_to_triton_dtype from utils.types instead of duplicated TORCH_TO_TL_DTYPE - Fix: Use ProcessPoolExecutor with spawn context to avoid CUDA reinitialization issues in parallel test execution - Add: Extended AOT prebuild test configurations for head dimensions (64, 192, 256) - Add: New test case set options (normal_accuracy_aot, sliding_window_performance) - Format: Apply black code formatting * Fix argument parsing conflict when running via pytest - Add sys import for argv access - Detect pytest execution environment and use empty args to avoid conflicts with pytest's command line arguments --------- Co-authored-by: fsx950223 <[email protected]>
* add mha * add attn ck * fix * fix II * fix III * fix IV * format * format II * fix5 * bug fix6 * fix 7 * bug 8 * some fix * bug fix * format * format II * update CK * update CK --------- Co-authored-by: zufayu <[email protected]> Co-authored-by: amd-ruitang3 <[email protected]>
* rebase and init moe optimization * add avg col in common * fix
Aiter fails import test with error ModuleNotFoundError: No module named 'packaging' The aiter package imports and uses 'packaging' module at runtime in multiple files, but only declares it in setup_requires (build-time) instead of also declaring in install_requires (runtime). This causes "ModuleNotFoundError: No module named 'packaging'" when importing aiter in environments where 'packaging' is not already installed. This PR looks to fix this issue by patching setup.py to include aiter packaging as a runtime dependency. Signed-off-by: Anu Oguntayo <[email protected]>
GitHub Actions CI pipeline is aborted if a process exits with a code other than zero. This commit fixes a bug in Triton test selection script, no matter the test selection outcome, CI pipeline shouldn't be aborted.
* enable gptoss_sink Signed-off-by: Linjun-AMD <[email protected]> * Update csrc/py_itfs_ck/mha_batch_prefill_kernels.cu Co-authored-by: Copilot <[email protected]> * Update mha_batch_prefill_kernels.cu * update mha_bwd parameter Signed-off-by: Linjun-AMD <[email protected]> * Update mha.py * Fix formatting for bias argument in rocm_ops.hpp * fix some format error Signed-off-by: Linjun-AMD <[email protected]> * Update mha.py * update args Signed-off-by: Linjun-AMD <[email protected]> * Update mha_fwd.cpp * update ck commit Signed-off-by: Linjun-AMD <[email protected]> * use atier main branch ck commit Signed-off-by: Linjun-AMD <[email protected]> * update ck commit Signed-off-by: Linjun-AMD <[email protected]> * Update mha_batch_prefill_kernels.cu --------- Signed-off-by: Linjun-AMD <[email protected]> Co-authored-by: Copilot <[email protected]>
Co-authored-by: solin <[email protected]> Co-authored-by: Xin Huang <[email protected]>
* fix(paps): fix support for multi kheads Signed-off-by: Double Young <[email protected]> * fix(paps): fix reset work_indptr and use empty init in ut Signed-off-by: Double Young <[email protected]> --------- Signed-off-by: Double Young <[email protected]>
…#1762) * [Docs] Add README for Triton Ops detailing general maintenance points
* initial commit * fix * test ck tile tuning * temp save * tem save * refactor * fix tile * support ck tile abquant * fix error * fix error * fix error * fix error * fix error * test tuning * fix tile compile error * add more tile instance * test tile instance tuning * add more valid instances * fix test bug * fix default tile instance * fix * fix actions error * format code style * Apply Black 25.12.0 formatting to match CI * fix CI * fix CI * rename lagacy * add profile result * update ck * code format * fix mismatch ck kernel * fix CI * delete tune flag * update ck * merge aiter main branch
* Testing fake_tensor fix * Same logic for var len attn * Fix --------- Co-authored-by: Lingpeng Jin <[email protected]>
…ernally (#1821) * Implement a new api that will be switching between asm and hip pa Inference engines should be calling paged_attention_common now with shuffled kv cache layout and aiter internally will decide between asm or hip kernel. HIP is more performant for lower concurrencies ( < 128). Also a unit test has been updated to include the new interface. Note that support for the shuffled scales in HIP is not supported and is always redirected to asm now when KV cache is in int8 or fp8 formats. * Delete op_tests/README_pa_merged_tests.md * Delete op_tests/test_pa_merged.py * Fix formatting according to Black requirements * Fix one last place with broken formatting * Remove modification to pa_v1, we already have pa for 5D kv cache * Fix another formatting issue * Add proper quant support for the common API * Apply formatting * Remove redundant parameters * Remove redundant parameters --------- Co-authored-by: Sergey Solo <[email protected]> Co-authored-by: Mikko Tukiainen <[email protected]>
* add_tune_dsfp4_gemm * update * Update aiter/jit/core.py Co-authored-by: Copilot <[email protected]> --------- Co-authored-by: Copilot <[email protected]>
* Fused rope_kv and bmm * Apply suggestion from @github-actions[bot] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Apply suggestion from @github-actions[bot] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update fused_bmm_rope_kv_cache.py * Update fused_bmm_rope_kv_cache.py * add test * update * update * parse bmm config * fp8 API and kernel change * fp8 UT * Apply suggestion from @github-actions[bot] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Apply suggestion from @github-actions[bot] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Formatting with black * pytest skip if fp4/8 is not avail on device * code format with black --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ShaoChunLee <[email protected]>
* spkil mla_prefill_ps when gfx942 * use chip info
* Causal conv1d triton sglang
…rage speedup on MI300X (#1879)
* CI: Temporarily use previous vllm nightly image * Update vllm_benchmark.yaml * Update vllm_benchmark.yaml
…1868) * Do not parse AITER source files that have hard to track dependencies These source files can be safely ignored for Triton test selection purposes. * Do not tag `__init__.py` files as kernels * Resolve Gluon config files by replacing GEMM M/N/K placeholders * Refactor JSON string resolution * Apply `markdownlint` suggestions to Triton ops `README.md` * Document test selection script in Triton ops `README.md` * Document `expand_mnk` function
* added gluon kernel for gemm_afp4wfp4 * updated the gemm_afp4wfp4 test and bench scripts to support running the kernel
* first version:add num_kv_head_dim * opt scalar gpr * fix k=128 error and refactor code * fix test * fix test * add interface comment * fix test format * update * update test
* add asm topksoftmax for bf16 input [gfx942] * fix fused_moe_dp_share_expert * update test * add asm topksoftmax for bf16 input [gfx950] * add 128 expert topk4 * fix * fix2 * fix3 --------- Co-authored-by: amd-ruitang3 <[email protected]>
* fix: pa_decode_gluon import cause ci error * docs: black format * format with black --------- Co-authored-by: onealliu <[email protected]> Co-authored-by: zufayu <[email protected]>
* great refactor mla reduce * includes are wrongly ordered by clang-format
…pport 7.2 (#1906) * CI: Temporarily switch Aiter tests back to ROCm 7.1 as CK does not support 7.2 * Update aiter-test.yaml * Update .github/workflows/aiter-test.yaml Co-authored-by: Copilot <[email protected]> --------- Co-authored-by: Copilot <[email protected]>
#1907) * add add_rmsnorm_quant_kernel v1 * add add_rmsnorm * fix address offset int32 overflow && optimize kernel * add block quant * add rmsnorm/rmsnorm_quant api and rewrite ut * fix fp4 quant and optimize kernel dispatch * add some check * optimize kernel some instances use NT load * optimize kernel 2 * update api and add kernel instance * update * format
* add fp8 block scale quantization stride parameters * add start ptr and block size parameters * update ck develop version * update block scale parameters * format code --------- Co-authored-by: Xin Huang <[email protected]>
* Added lookups for moe_routing
* Fix issue on calculation of wg per batch and head. * Move loading reduce_partial_map to simple and massive pipeline respectively. * clang-format Fix clang-format issue
Signed-off-by: Double Young <[email protected]>
Contributor
Author
Contributor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.


Motivation
Technical Details
change max wave num to be 16 if n>=16384
update output alloc in test
Test Plan
Test Result
Submission Checklist