Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
c35c8ac
add initial implementation of projection mapping
anhminhnguyenhoang Jan 16, 2026
2a8325c
Refactor mHC kernel and wrapper to include sigmoid activation in proj…
waqahmed-amd-fi Jan 19, 2026
f11244d
Add Sinkhorn-Knopp log-domain kernel implementation
anhminhnguyenhoang Jan 20, 2026
87e5839
clean up sinkhorn-knopp tests
anhminhnguyenhoang Jan 20, 2026
152bb21
review invalid test case
anhminhnguyenhoang Jan 20, 2026
18a62a1
Refactor mHC kernel and wrapper to implement equations 14-18 as fused…
waqahmed-amd-fi Jan 20, 2026
36e36d6
Fix H dims
waqahmed-amd-fi Jan 20, 2026
6156c13
fix test_mhc_output_range
waqahmed-amd-fi Jan 20, 2026
80b8d34
Refactor test cases in mHC and Sinkhorn-Knopp implementations and sim…
anhminhnguyenhoang Jan 20, 2026
4801d63
Fix issues (#1878)
waqahmed-amd-fi Jan 21, 2026
d952829
optimization to loads x_tile once, reducing memory bandwidth
waqahmed-amd-fi Jan 21, 2026
30741ca
Update mHC implementation to apply Sinkhorn-Knopp (Equation 19) to ma…
waqahmed-amd-fi Jan 21, 2026
686711f
Refactor mHC implementation to separate projection (phi) matrices int…
waqahmed-amd-fi Jan 21, 2026
831572b
Enhance mHC fused kernel to implement stream-aware processing
anhminhnguyenhoang Jan 22, 2026
ad198b3
Refactor mHC implementation
anhminhnguyenhoang Jan 22, 2026
05810eb
Adjust tolerance levels in mHC tests based on input size to improve a…
anhminhnguyenhoang Jan 22, 2026
7d333a8
Add benchmark scripts for mHC kernel performance evaluation
waqahmed-amd-fi Jan 22, 2026
46d023a
add modes to bench
waqahmed-amd-fi Jan 23, 2026
e8f4464
- Add naive configs for fused mHC and Sinkhorn-Knopp kernels
anhminhnguyenhoang Jan 23, 2026
68a2df8
switch to using exp2/log2 for sinkhorn-knopp for optimization
anhminhnguyenhoang Jan 23, 2026
df24377
Sort benchmark configurations by hidden dimension and refine FLOPs ca…
waqahmed-amd-fi Jan 26, 2026
a4a1793
Refactor Sinkhorn-Knopp kernel to support batch processing
anhminhnguyenhoang Jan 26, 2026
23609ed
better tuned configs
anhminhnguyenhoang Jan 26, 2026
514f3f5
Refactor mHC fused kernel for improved arithmetic operations and clar…
waqahmed-amd-fi Jan 27, 2026
10b122a
Add split-K support to mHC kernel by new split and reduce kernels to …
anhminhnguyenhoang Jan 27, 2026
9f6aba3
add better config with split reduce usage
anhminhnguyenhoang Jan 27, 2026
833c47a
Apply optim in mhc_fused to split reduce kernels, rename functions fo…
anhminhnguyenhoang Jan 27, 2026
70490ad
Add json config loading
anhminhnguyenhoang Jan 28, 2026
b01c00a
Add tuned JSON configuration files for fused mhc kernels
anhminhnguyenhoang Jan 28, 2026
42c944d
inittial implementation of zero-iteration Sinkhorn-Knopp (mHC-Lite). …
waqahmed-amd-fi Jan 28, 2026
190122e
optimized zero-iteration Sinkhorn-Knopp (mHC-Lite)
waqahmed-amd-fi Jan 29, 2026
507d95a
Removed Unused Projection Code (Wrapper Function)
waqahmed-amd-fi Jan 29, 2026
c2ccafc
add config loading bug fix due to caching and better tuned configs
anhminhnguyenhoang Jan 29, 2026
7862631
2D grid parallelization. Key improvements:
waqahmed-amd-fi Jan 29, 2026
cd84075
remove _sinkhorn_knopp_lite, and implement mHC_lite i.e., non-iterati…
waqahmed-amd-fi Jan 29, 2026
3e85b9d
add config loading bug fix due to caching and better tuned configs
anhminhnguyenhoang Jan 29, 2026
5374eb6
add mhc-lite
anhminhnguyenhoang Jan 30, 2026
4f4c272
revised mHC and mHC-Lite description for clarity
waqahmed-amd-fi Jan 30, 2026
57c75ac
update comments and replace if-else with assert check
waqahmed-amd-fi Jan 30, 2026
68c76b0
revised _mhc_lite_fused_split_kernel kernel
waqahmed-amd-fi Jan 30, 2026
604426d
revised _mhc_lite_fused_reduce_kernel
waqahmed-amd-fi Jan 30, 2026
e52ebdc
add mhc-lite bench mode
anhminhnguyenhoang Jan 30, 2026
e70f3dd
integrate mhc-lite into mhc_fused
anhminhnguyenhoang Jan 30, 2026
41b4908
update config loading for mode
anhminhnguyenhoang Jan 30, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion 3rdparty/composable_kernel
Submodule composable_kernel updated 76 files
+6 −6 .github/CODEOWNERS
+13 −3 CMakeLists.txt
+6 −34 Jenkinsfile
+16 −0 README.md
+30 −0 example/CMakeLists.txt
+1 −1 example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py
+12 −3 example/ck_tile/38_block_scale_gemm/CMakeLists.txt
+6 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_bf8.cpp
+6 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_bf8i4.cpp
+6 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_fp8.cpp
+6 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_fp8i4.cpp
+0 −222 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshuffleb.cpp
+53 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshuffleb_bf8.cpp
+57 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshuffleb_bf8i4.cpp
+53 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshuffleb_fp8.cpp
+57 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshuffleb_fp8i4.cpp
+0 −62 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshuffleb_preshufflequant.cpp
+50 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshuffleb_preshufflequant_bf8.cpp
+52 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshuffleb_preshufflequant_bf8i4.cpp
+50 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshuffleb_preshufflequant_fp8.cpp
+52 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshuffleb_preshufflequant_fp8i4.cpp
+0 −270 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshufflequant.cpp
+55 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshufflequant_bf8.cpp
+59 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshufflequant_bf8i4.cpp
+55 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshufflequant_fp8.cpp
+59 −0 example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_preshufflequant_fp8i4.cpp
+33 −6 example/ck_tile/38_block_scale_gemm/gemm_quant.cpp
+35 −26 experimental/builder/include/ck_tile/builder/reflect/conv_describe.hpp
+86 −641 experimental/builder/include/ck_tile/builder/reflect/conv_traits.hpp
+84 −0 ...l/builder/include/ck_tile/builder/reflect/conv_traits_device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp
+84 −0 ...uilder/include/ck_tile/builder/reflect/conv_traits_device_grouped_conv_fwd_multiple_abd_xdl_cshuffle_v3.hpp
+84 −0 ...nclude/ck_tile/builder/reflect/conv_traits_device_grouped_conv_fwd_multiple_d_xdl_large_tensor_cshuffle.hpp
+739 −0 experimental/builder/include/ck_tile/builder/reflect/conv_traits_helpers.hpp
+8 −0 experimental/builder/include/ck_tile/builder/reflect/instance_to_conv_traits.hpp
+8 −0 ...ilder/include/ck_tile/builder/reflect/instance_traits_device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp
+8 −0 ...er/include/ck_tile/builder/reflect/instance_traits_device_grouped_conv_fwd_multiple_abd_xdl_cshuffle_v3.hpp
+8 −0 ...de/ck_tile/builder/reflect/instance_traits_device_grouped_conv_fwd_multiple_d_xdl_large_tensor_cshuffle.hpp
+2 −1 experimental/builder/test/CMakeLists.txt
+78 −78 experimental/builder/test/conv/ck/test_conv_traits.cpp
+0 −1,127 experimental/builder/test/conv/ck/unit_instance_to_conv_traits.cpp
+800 −0 experimental/builder/test/conv/ck/unit_instance_to_conv_traits_features.cpp
+262 −0 experimental/builder/test/conv/ck/unit_instance_to_conv_traits_instances.cpp
+175 −63 include/ck/library/utility/gpu_verification.hpp
+4 −0 include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_wmma_cshuffle_v3.hpp
+8 −0 include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp
+4 −0 include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_bwd_weight_two_stage_wmma_cshuffle_v3.hpp
+4 −0 include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_bwd_weight_wmma_cshuffle_v3.hpp
+1 −1 include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp
+1 −1 include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_fwd_multiple_d_xdl_large_tensor_cshuffle.hpp
+59 −17 include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp
+2 −10 include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_problem.hpp
+17 −1 include/ck_tile/ops/gemm_quant/block/block_universal_gemm_as_bs_bquant_cr.hpp
+50 −27 include/ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp
+5 −4 include/ck_tile/ops/gemm_quant/pipeline/gemm_bquant_pipeline_ag_bg_cr_base.hpp
+3 −1 include/ck_tile/ops/gemm_quant/pipeline/gemm_bquant_pipeline_ag_bg_cr_policy.hpp
+3 −1 include/ck_tile/ops/gemm_quant/pipeline/gemm_bquant_pipeline_ag_bg_cr_v3.hpp
+48 −26 include/ck_tile/ops/gemm_quant/pipeline/gemm_group_quant_utils.hpp
+0 −2 include/ck_tile/ops/gemm_quant/pipeline/gemm_quant_pipeline_problem.hpp
+22 −13 include/ck_tile/ops/gemm_quant/pipeline/gemm_wp_bquant_pipeline_ag_bg_cr_v2.hpp
+54 −56 profiler/include/profiler/profile_grouped_conv_bwd_data_impl.hpp
+44 −62 profiler/include/profiler/profile_grouped_conv_bwd_weight_impl.hpp
+44 −30 profiler/include/profiler/profile_grouped_conv_fwd_impl.hpp
+254 −0 script/run-tests.ps1
+26 −0 test/ck_tile/gemm_block_scale/CMakeLists.txt
+39 −0 test/ck_tile/gemm_block_scale/test_gemm_quant_bquant_preshuffleQuant_decode_1d.cpp
+54 −0 test/ck_tile/gemm_block_scale/test_gemm_quant_bquant_preshuffleQuant_decode_2d.cpp
+41 −0 test/ck_tile/gemm_block_scale/test_gemm_quant_bquant_preshuffleQuant_prefill_1d.cpp
+63 −0 test/ck_tile/gemm_block_scale/test_gemm_quant_bquant_preshuffleQuant_prefill_2d.cpp
+8 −5 test/ck_tile/gemm_block_scale/test_gemm_quant_bquant_preshuffle_decode_2d.cpp
+10 −5 test/ck_tile/gemm_block_scale/test_gemm_quant_bquant_preshuffle_prefill_2d.cpp
+22 −29 test/ck_tile/gemm_block_scale/test_gemm_quant_fixtures.hpp
+4 −3 test/ck_tile/gemm_streamk/CMakeLists.txt
+1 −1 test/gpu_verification/test_gpu_verification.cpp
+21 −17 test/grouped_convnd_bwd_weight/test_grouped_convnd_bwd_weight.cpp
+4 −4 test/grouped_convnd_bwd_weight/test_grouped_convnd_bwd_weight_bilinear.cpp
+4 −4 test/grouped_convnd_bwd_weight/test_grouped_convnd_bwd_weight_scale.cpp
20 changes: 20 additions & 0 deletions aiter/ops/triton/_triton_kernels/fusions/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# SPDX-License-Identifier: MIT
# Copyright (C) 2024-2026, Advanced Micro Devices, Inc. All rights reserved.

from aiter.ops.triton._triton_kernels.fusions.mhc import (
_mhc_fused_kernel,
_mhc_fused_split_kernel,
_mhc_fused_reduce_kernel,
_mhc_lite_fused_split_kernel,
_mhc_lite_fused_reduce_kernel,
_sinkhorn_knopp_log_domain_kernel,
)

__all__ = [
"_mhc_fused_kernel",
"_mhc_fused_split_kernel",
"_mhc_fused_reduce_kernel",
"_mhc_lite_fused_split_kernel",
"_mhc_lite_fused_reduce_kernel",
"_sinkhorn_knopp_log_domain_kernel",
]
Loading