You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cute.experimental: introduce a higher-level, composable layer on top of existing CuTe DSL APIs (not a separate abstraction), which can be mixed with existing Cute DSL building blocks.
Fragment-free programming model: copy/dot APIs take memrefs directly instead of descriptors/fragments.
Automatic TMA descriptor generation and update insertion.
Automatic vectorization and predication for SIMT copies.
New pipeline abstraction with convenience wrappers
New Partition ops to simplify partitioning logic.
Device-side TMA descriptor allocation, initialization, and management
cutlass.version for a string representation of DSL version
cutlass.CUDA_VERSION for a version class to tell the CUDA version used for DSL
Added CopyDsmemStoreOp to store data to distributed shared memory with explicit synchronization.
Grouped GEMM example now supports device-only problem shapes.
We allow grid carve-out without problem shapes being available on host.
Tma+LdMatrix features for loading+unpacking narrow-width types (refer to mixed_input_fmha_decode.py for example usage).
It is possible now to have customized epilogue fusion for persistent dense GEMM through a Python Epilogue Fusion Configuration (EFC) function, somewhat similar to CUTLASS C++ EVT. It also provides a PyTorch evaluator to compare the results.
More examples of authorizing peak-performance kernels
Mixed input FMHA decode example with support for int4 KV (int8 KV supported in 4.3)
New acc_scale grouped mixed input gemm kernel variant is introduced to deliver better performance for decoding cases.
All mixed_input_gemm examples are moved into a separate folder mixed_input_gemm. Common utility functions are also extracted into mixed_input_host_utils.py under the same folder.
Bug fixing and improvements
Fixed an issue that both branches of if are executed
Deprecate get_num_tmem_alloc_cols from blackwell_helpers.py. Use the one from tmem_allocator.py instead.
Deprecate SM100_TMEM_CAPACITY_COLUMNS and SM100_TMEM_MIN_ALLOC_COLUMNS.
LdMatrix16x16x8bOp and StMatrix16x8x8bOp now require explicit transpose=True when calling init, to avoid ambiguity in data transposition.
LdMatrix16x16x8bOp copy traits updated to be faithful to PTX without permutations. Permuted variant is renamed to LdMatrix16x8x8bOp.
Grouped GEMM example takes the argument --host_problem_shape_available. If the argument is provided, grid is carved out based upon the host problem shapes, otherwise, we launch maximum possible SMs.
hardware_info.get_max_active_cluster support pass in specific stream to query. Useful for green context based SM partition.
group_bulk_copy_modes in async bulk copy example is now deprecated, use group_modes directly instead.
Deprecate nvvm wrapper from using nvvm enum, use str instead.
cute.arch.calc_packed_f32x2_op default enable ftz to default disable ftz
In CuTe DSL with CTK 13.1, following APIs in cutlass.cute.arch now require string literal instead of enum as argument:
fence_proxy
fence_view_async_tmem_op
calc_packed_f32x2_op
warp_redux_sync
atomic_add
atomic_and
atomic_or
atomic_xor
atomic_max
atomic_min
atomic_exch
atomic_cas
store
load
Use 'Advanced control file' for mixed input gemm examples for better performance.
Advanced control file is an experimental feature of CUDA compiler. The controls file contains internal compiler settings tuned for specific kernels with a specific version of CUDA toolkit to get better GPU kernel code. More details and documentation on how to create these controls files will be provided in future CUDA toolkit release. Note: The advanced compiler control file is not expected to work for kernels that it was not tuned for. There is no compatibility guarantee, and the controls file will not work for CUDA toolkit with a different version.
CUTLASS C++
Add example 93 for Blackwell low latency generation phase GQA kernel.
Add Blackwell SM100 State Space Decomposition (SSD) kernel in example 112.
Add Hopper SM90 State Space Decomposition (SSD) kernel in example 111.
Add example 94 for Ada FP8xFP8 -> BF16 GEMM with blockwise dequantization of input matrices in the MMA loop with FP32 accumulation.
Generate additional device/kernel/threadblock files in CUTLASS include directory that add functionality to carry the scaling tensors + use them in MMA loop.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
CuTe DSL
New features
More examples of authorizing peak-performance kernels
mixed_input_gemm. Common utility functions are also extracted into mixed_input_host_utils.py under the same folder.Bug fixing and improvements
cute.printfwith f-stringAPI changes
Use 'Advanced control file' for mixed input gemm examples for better performance.
CUTLASS C++
compute_memory_reordering_atom<tfloat32_t>()TmaGbasisinAuxTmaParams.tma_gbasis.TmaGbasisparameter ofAuxTmaParamsand users are allowed to manually construct a dynamic gbasis.media/docs.This discussion was created from the release CUTLASS 4.4.0.
Beta Was this translation helpful? Give feedback.
All reactions