[Pytorch] Add Cutlass GroupGEMM Support for fine-grained MoE Model #2045

cassiewilliam · 2025-08-08T08:54:02Z

Description

Add Cutlass Group GEMM Support for H100(SM90), which provides greater performance advantages compared to the current Multi-Stream implementation in Fine-Grained MoE models. Currently, this PR only supports FP16 and BF16 scenarios, and FP8 support is not yet available. The implementation is limited to the standard MoE Module (Bias and other related features have not been validated yet). Please take note.

Initial performance test results are as follows, and the testing method can be found in file test_group_gemm.py.

run test script with

python tests/pytorch/test_group_gemm.py

Shape(g,m,n,k)	TE V2.2 (TFLOPs)	Cutlass-Opt-V1(TFLOPs)	Speed-Up
(8, 4096, 768, 2048)	508.77	568.63	11.77%
(16, 2048, 768, 2048)	398.81	534.75	34.08%

Add the system environment variable NVTE_USE_CUTLASS_GROUPGEMM to toggle between the two GEMM implementations. Setting export NVTE_USE_CUTLASS_GROUPGEMM=0 selects the original Multi-Stream cuBLAS GEMM, while setting export NVTE_USE_CUTLASS_GROUPGEMM=1 enables the newly added CUTLASS Group GEMM. The default value is 0.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

phu0ngng · 2025-08-11T13:08:52Z

Hi @cassiewilliam ,

Thank you for a great PR - it’s good to see such a clear performance improvement!

I have one suggestion - I think we should refactor the change slightly to minimize modifications in the TE framework extensions.

Currently, we have two separate C APIs: nvte_multi_stream_cublas_gemm and nvte_cutlass_grouped_gemm. The PyTorch extensions call these individually, and we would need to do the same on the JAX side. Since they share the same function signature, we could unify them into a single API - nvte_multi_tensor_gemm - and deprecate nvte_multi_stream_cublas_gemm.

Within nvte_multi_tensor_gemm, we can determine the GPU architecture and enable CUTLASS GroupedGEMM for FP16/BF16 on Hopper. This way, future changes to the GroupedGEMM implementation or backend would not require modifications to the PyTorch/JAX extensions.

cassiewilliam · 2025-08-12T01:59:09Z

Hi @cassiewilliam ,

Thank you for a great PR - it’s good to see such a clear performance improvement!

I have one suggestion - I think we should refactor the change slightly to minimize modifications in the TE framework extensions.

Currently, we have two separate C APIs: nvte_multi_stream_cublas_gemm and nvte_cutlass_grouped_gemm. The PyTorch extensions call these individually, and we would need to do the same on the JAX side. Since they share the same function signature, we could unify them into a single API - nvte_multi_tensor_gemm - and deprecate nvte_multi_stream_cublas_gemm.

Within nvte_multi_tensor_gemm, we can determine the GPU architecture and enable CUTLASS GroupedGEMM for FP16/BF16 on Hopper. This way, future changes to the GroupedGEMM implementation or backend would not require modifications to the PyTorch/JAX extensions.

I fully agree with your suggestion — keeping the code architecture clean is very important. Will you be handling the refactor on your side, or should I go ahead and make the changes directly in the current PR?

yaox12 · 2025-08-12T02:33:02Z

Agree with @phu0ngng. We could unify the API and do the dispatch (based on GPU arch/data type/env variable) on the TE/common side.

Will you be handling the refactor on your side, or should I go ahead and make the changes directly in the current PR?

Please go ahead in this PR.

cassiewilliam · 2025-08-12T03:06:47Z

Agree with @phu0ngng. We could unify the API and do the dispatch (based on GPU arch/data type/env variable) on the TE/common side.

Will you be handling the refactor on your side, or should I go ahead and make the changes directly in the current PR?

Please go ahead in this PR.

Got it — I’ll refactor the code to meet the requirements described above.

cassiewilliam · 2025-08-13T04:32:03Z

hello @phu0ngng @yaox12 The nvte_multi_tensor_gemm interface has been fully refactored. Please review the implementation for correctness and compliance with the updated design.

tests/pytorch/test_group_gemm.py

transformer_engine/common/CMakeLists.txt

transformer_engine/common/gemm/cublaslt_gemm.cu

transformer_engine/common/gemm/cutlass_groupgemm.cuh

transformer_engine/common/gemm/cublaslt_gemm.cu

transformer_engine/common/gemm/cutlass_groupgemm.cu

transformer_engine/common/gemm/cutlass_groupgemm.cuh

yaox12 · 2025-08-18T05:33:21Z

To pass the DCO check, you have to sign your commits. See here for more details.

Signed-off-by: Min Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Min Yang <[email protected]>

for more information, see https://pre-commit.ci

… info Signed-off-by: Min Yang <[email protected]>

for more information, see https://pre-commit.ci

cassiewilliam · 2025-08-18T06:50:07Z

To pass the DCO check, you have to sign your commits. See here for more details.

Appreciate the feedback — the changes have been made.

yaox12 · 2025-08-18T08:14:44Z

Please add a test here. https://github.com/NVIDIA/TransformerEngine/blob/main/tests/pytorch/test_numerics.py#L2520.
I think you can add an argument use_cutlass_group_gemm, and set the environment variable if it's true (remember to unset it at the end of test). I won't expect CUTLASS implementation to be bit-wise match with the baseline, so you need to set tolerances.

Signed-off-by: Min Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Min Yang <[email protected]>

Signed-off-by: Xin Yao <[email protected]>

for more information, see https://pre-commit.ci

yaox12 · 2025-08-19T07:17:42Z

@cassiewilliam I enabled some tests in e928062. Please make sure the following tests pass.

pytest -v -s tests/pytorch/test_numerics.py::test_grouped_linear_accuracy_cutlass
pytest -v -s tests/pytorch/test_numerics.py::test_grouped_gemm

I tested it locally and found tests with fuse_wgrad_accumulation = True would fail. When enabling fuse_wgrad_accumulation, the output is usually in FP32, but you're using the same data type for C as A/B. This may explain the failures in test_grouped_linear_accuracy_cutlass. But in test_grouped_gemm, the output is in the same dtype as inputs, so I'm not sure why they failed. Maybe the tolerance is not correctly set.

yaox12 · 2025-08-19T07:19:10Z

transformer_engine/common/gemm/cutlass_groupgemm.cuh

+struct GemmGivenSchedule {
+  using ElementA = typename ScheduleConfig::DataType;  // Element type for A matrix operand
+  using ElementB = typename ScheduleConfig::DataType;  // Element type for B matrix operand
+  using ElementC = typename ScheduleConfig::DataType;  // Element type for C and D matrix operands


When enabling fuse_wgrad_accumulation, C/D could have a different data type from A/B.

Thank you for taking the time to review this. I’ll make the necessary changes and push the fixes as soon as possible.

cassiewilliam · 2025-08-20T01:42:32Z

@cassiewilliam I enabled some tests in e928062. Please make sure the following tests pass.
pytest -v -s tests/pytorch/test_numerics.py::test_grouped_linear_accuracy_cutlass
pytest -v -s tests/pytorch/test_numerics.py::test_grouped_gemm
I tested it locally and found tests with fuse_wgrad_accumulation = True would fail. When enabling fuse_wgrad_accumulation, the output is usually in FP32, but you're using the same data type for C as A/B. This may explain the failures in test_grouped_linear_accuracy_cutlass. But in test_grouped_gemm, the output is in the same dtype as inputs, so I'm not sure why they failed. Maybe the tolerance is not correctly set.

Thank you for taking the time to review this. I’ll make the necessary changes and push the fixes as soon as possible.

zhongbozhu · 2025-08-20T04:45:27Z

Great PR! Thank you so much. CC @timmoon10

leefige · 2025-08-20T12:31:15Z

transformer_engine/common/gemm/cutlass_groupgemm.cuh

+    NVTE_CHECK(false, "Failed to run CUTLASS Grouped GEMM");
+  }
+
+  std::free(host_workspace);


Nice work! I would like to remind that this free will be dangerous if kernel launch on host is much faster than kernel execution on GPU, because I don't think gemm.run implies stream synchronization. In that case, before cudaMemcpyAsync could copy host_workspace to all_workspace, host_workspace may have been destroyed.

So I suggest adding a CUDA stream sync before free. Otherwise, you can try a different way to maintain ptrs and shapes in device memory.

phu0ngng · 2025-08-20T13:15:49Z

/te-ci L0

cassiewilliam force-pushed the feature/cutlass_group_gemm_support branch 2 times, most recently from d2a9a55 to b42385d Compare August 8, 2025 09:14

phu0ngng requested review from phu0ngng, yaox12 and timmoon10 August 8, 2025 14:13

cassiewilliam force-pushed the feature/cutlass_group_gemm_support branch 12 times, most recently from 6f01bc8 to e832972 Compare August 13, 2025 04:24

phu0ngng reviewed Aug 13, 2025

View reviewed changes

cassiewilliam force-pushed the feature/cutlass_group_gemm_support branch 7 times, most recently from a023c5f to a76e1cd Compare August 18, 2025 03:58

cassiewilliam force-pushed the feature/cutlass_group_gemm_support branch 3 times, most recently from 76c1e5e to e4b5e60 Compare August 18, 2025 04:08

yaox12 reviewed Aug 18, 2025

View reviewed changes

Min Yang and others added 6 commits August 18, 2025 14:43

feat: add cutlass group gemm support

4ac7be8

Signed-off-by: Min Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a5562d1

for more information, see https://pre-commit.ci

refactor: refactor multi tensor gemm interface

68948fb

Signed-off-by: Min Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b151755

for more information, see https://pre-commit.ci

refactor: refactor nvte_multi_stream_cublas_gemm func and add license…

5126889

… info Signed-off-by: Min Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

eb3f462

for more information, see https://pre-commit.ci

cassiewilliam force-pushed the feature/cutlass_group_gemm_support branch from 75d075a to eb3f462 Compare August 18, 2025 06:43

Merge branch 'main' into feature/cutlass_group_gemm_support

a362a01

Min Yang and others added 3 commits August 18, 2025 18:05

feat: add unit test for cutlass group gemm

ec44a6f

Signed-off-by: Min Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

22f5c47

for more information, see https://pre-commit.ci

feat: add cutlass support type protect

15d750d

Signed-off-by: Min Yang <[email protected]>

cassiewilliam force-pushed the feature/cutlass_group_gemm_support branch from c3010d9 to 15d750d Compare August 18, 2025 11:44

yaox12 and others added 3 commits August 19, 2025 07:07

add tests and fix lint

e928062

Signed-off-by: Xin Yao <[email protected]>

Merge branch 'main' into feature/cutlass_group_gemm_support

d8697ea

[pre-commit.ci] auto fixes from pre-commit.com hooks

00e96c4

for more information, see https://pre-commit.ci

yaox12 reviewed Aug 19, 2025

View reviewed changes

leefige reviewed Aug 20, 2025

View reviewed changes

Merge branch 'main' into feature/cutlass_group_gemm_support

c568733

[Pytorch] Add Cutlass GroupGEMM Support for fine-grained MoE Model #2045

Are you sure you want to change the base?

[Pytorch] Add Cutlass GroupGEMM Support for fine-grained MoE Model #2045

Conversation

cassiewilliam commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

phu0ngng commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cassiewilliam commented Aug 12, 2025

Uh oh!

yaox12 commented Aug 12, 2025

Uh oh!

cassiewilliam commented Aug 12, 2025

Uh oh!

cassiewilliam commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaox12 commented Aug 18, 2025

Uh oh!

cassiewilliam commented Aug 18, 2025

Uh oh!

yaox12 commented Aug 18, 2025

Uh oh!

yaox12 commented Aug 19, 2025

Uh oh!

yaox12 Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

cassiewilliam Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

cassiewilliam commented Aug 20, 2025

Uh oh!

zhongbozhu commented Aug 20, 2025

Uh oh!

leefige Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng commented Aug 20, 2025

Uh oh!

Uh oh!

cassiewilliam commented Aug 8, 2025 •

edited

Loading

phu0ngng commented Aug 11, 2025 •

edited

Loading