What's Changed
- Support math_sm_count for GEMM
- Added drop in Triton replacement for layernorm, rmsnorm
- Added Triton MXFP8 quantize/dequantize
- Reduce fp8 weight transpose cache occupied
- Switched to AOTriton 0.10c
- Switched from CK to AITER
- JAX 0.7 support
- FlashAttn 2.8.0.post2 support
- Add gfx950 as default target
- Fix building on ROCm6.2
- Fix faults with current scaling
Upstream release notes: https://github.com/NVIDIA/TransformerEngine/releases/tag/v2.2
Full Changelog: v2.1_rocm...v2.2_rocm