You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
NOTICE:
Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution kernel::ConvUniversal API to bring it in line with gemm::GemmUniversal. After this, the 3.x convolution API will no longer be considered as a beta API.
Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.
This discussion was created from the release CUTLASS 3.5.1.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
cache_hints in TMA copy atomsexample 48.
GemmSparseUniversalAPI for CUTLASS 2.x Ampere kernels to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference.kernel::ConvUniversalAPI to bring it in line withgemm::GemmUniversal. After this, the 3.x convolution API will no longer be considered as a beta API.This discussion was created from the release CUTLASS 3.5.1.
Beta Was this translation helpful? Give feedback.
All reactions