[PyTorch] CPU Overhead Micro-optimizations #2146

zhongbozhu · 2025-09-02T19:42:35Z

Description

Motivation: #2053

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: zhongboz <[email protected]>

timmoon10 · 2025-09-02T20:22:06Z

transformer_engine/common/transformer_engine.cpp

@@ -641,11 +641,15 @@ void nvte_destroy_quantization_config(NVTEQuantizationConfig config) {
 }

 int nvte_is_non_tn_fp8_gemm_supported() {


This doesn't handle the case where we have multiple GPUs with different archs. We could add an arg for the device ID, but that just pushes the CPU overhead problem somewhere else.

Yeah, but we didn't really support this case anyway?

For topology like 1 CPU 8/4GPUs with homogenous GPU arch, we can cache the TN layout check.

timmoon10 · 2025-09-02T20:24:20Z

transformer_engine/pytorch/module/linear.py

-        with torch.cuda.device(
-            getattr(self, list(self.named_parameters())[0][0]).device
-        ), self.prepare_forward(
+        if is_first_microbatch is None or is_first_microbatch:


How do we assume we can skip setting the device if is_first_microbatch=False?

I assume that the device won't change across microbatches in a global batch.

Since in a CPU bounded fwd only case, skipping set device for every single forward pass could account for 10% perf difference.

This approach is really ad hoc. Personally, I think it would be better to not to support the multi-device case (basically revert #1974) than to have inconsistent multi-device support.

I agree with it, but not sure if there are any potential impact for customers using this feature?

Signed-off-by: zhongboz <[email protected]>

zhongbozhu · 2025-09-03T03:24:05Z

/te-ci pytorch L1

zhongbozhu · 2025-09-03T17:04:15Z

/te-ci pytorch L1

zhongbozhu added 2 commits September 2, 2025 10:53

fix cpu overhead

488e6c9

Signed-off-by: zhongboz <[email protected]>

cache tn layout check

a831adb

Signed-off-by: zhongboz <[email protected]>

zhongbozhu requested a review from timmoon10 September 2, 2025 19:42

zhongbozhu self-assigned this Sep 2, 2025

apply to more modules

dc2532a

Signed-off-by: zhongboz <[email protected]>

zhongbozhu force-pushed the zhongbo/cpu_overhead branch from 81adf6d to dc2532a Compare September 2, 2025 19:47

license

5fbfb01

Signed-off-by: zhongboz <[email protected]>

timmoon10 reviewed Sep 2, 2025

View reviewed changes

lint

3af9438

Signed-off-by: zhongboz <[email protected]>

Merge branch 'main' into zhongbo/cpu_overhead

7c48bea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PyTorch] CPU Overhead Micro-optimizations #2146

[PyTorch] CPU Overhead Micro-optimizations #2146

Uh oh!

zhongbozhu commented Sep 2, 2025

Uh oh!

timmoon10 Sep 2, 2025

Uh oh!

zhongbozhu Sep 2, 2025

Uh oh!

zhongbozhu Sep 2, 2025

Uh oh!

timmoon10 Sep 2, 2025

Uh oh!

zhongbozhu Sep 2, 2025

Uh oh!

zhongbozhu Sep 2, 2025

Uh oh!

timmoon10 Sep 3, 2025

Uh oh!

zhongbozhu Sep 4, 2025

Uh oh!

zhongbozhu commented Sep 3, 2025

Uh oh!

zhongbozhu commented Sep 3, 2025

Uh oh!

Uh oh!

		@@ -641,11 +641,15 @@ void nvte_destroy_quantization_config(NVTEQuantizationConfig config) {
		}

		int nvte_is_non_tn_fp8_gemm_supported() {

[PyTorch] CPU Overhead Micro-optimizations #2146

Are you sure you want to change the base?

[PyTorch] CPU Overhead Micro-optimizations #2146

Uh oh!

Conversation

zhongbozhu commented Sep 2, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhongbozhu commented Sep 3, 2025

Uh oh!

zhongbozhu commented Sep 3, 2025

Uh oh!

Uh oh!