Add logic for block-scaled tensors with GEMM swizzled scales #2486

timmoon10 · 2025-12-06T02:48:10Z

Description

All of the supported block-scaled tensor formats (MXFP8, NVFP4, DSv3 FP8) have two ways of ordering their scaling factors:

"Compact" ordering for quantization, dequantization, and communication
"Swizzled" ordering for GEMM

The core infrastructure handles this in an ad hoc way, blindly assuming that the "right" scale ordering is used for the different operations. The PyTorch infrastructure only supports MXFP8 and NVFP4 scales are in compact order, although DSv3 FP8 does have awareness of "compact" and "GEMM-ready" formats. This situation makes it hard to implement fused kernels that can bypass the swizzle kernel.

This PR adds a with_gemm_swizzled_scales field in the C++ tensor class so that the core infrastructure can distinguish between the different scale orderings. It also adds this field in the PyTorch quantized tensor classes, and exposes a optimize_for_gemm option in the quantizer so that we can create tensors that do not need communication or checkpointing. Finally, it rips out all the DSv3 FP8 infrastructure for the compact format, which is no longer necessary.

Progress

Closes #2446.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Support GEMM swizzled scales in C++ tensor class
Support GEMM swizzled scales in PyTorch quantized tensor classes
Support optimize_for_gemm option in PyTorch quantizer
Expose PyTorch function to swizzle scales
Support MXFP8 quantization with pre-swizzled scales
Enable fused quantize+swizzle kernels in linear module and related
Remove DSv3 FP8 compact data format. It was used to avoid all-gather interleaving, which we can now fix with the swap-first-dims kernel.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2025-12-22T23:02:32Z

/te-ci L1

zhongbozhu · 2026-01-05T18:23:11Z

transformer_engine/common/transformer_engine.cpp

+  }
+}
+
+void nvte_set_tensor_param_v2(NVTETensor tensor, NVTETensorParam param, const void *buf,


Why do we need a v2 here?

I don't want to break the existing APIs. That said, this PR isn't fully backward-compatible because the GEMM no longer secretly assumes that MXFP8 scales are swizzled.

Signed-off-by: Tim Moon <[email protected]>

@greptile-apps

Update copyright years. Tweak comments. Fix various complaints from @greptile-apps. Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2026-01-05T21:32:17Z

/te-ci L1

timmoon10 added 9 commits December 4, 2025 21:51

Add general C API for setting tensor params

0563c1a

Signed-off-by: Tim Moon <[email protected]>

Implement general accessors for NVTETensor

5c9b1be

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into tmoon/pre-swizzled-scales

219ddc1

Refactor tex swizzling to skip if scales are already swizzled

1c49646

Signed-off-by: Tim Moon <[email protected]>

Add checks for non-swizzled scales in MXFP8 and NVFP4 kernels

5f60184

Signed-off-by: Tim Moon <[email protected]>

Support pre-swizzled scales in MXFP8Tensor

21ec928

Signed-off-by: Tim Moon <[email protected]>

Add tex function to swizzle MXFP8 scales

fa7e7c0

Signed-off-by: Tim Moon <[email protected]>

Fix bug in inplace swizzle function

b796c96

Signed-off-by: Tim Moon <[email protected]>

Tweak comments to use "compact/swizzled format"

52ce3a4

Signed-off-by: Tim Moon <[email protected]>

timmoon10 force-pushed the tmoon/pre-swizzled-scales branch from d274220 to 52ce3a4 Compare December 6, 2025 02:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

5c7c1d9

for more information, see https://pre-commit.ci

timmoon10 added enhancement New feature or request refactor labels Dec 6, 2025

timmoon10 added 5 commits December 9, 2025 21:20

MXFP8 quantize kernel with pre-swizzled scales

dfb4b94

Signed-off-by: Tim Moon <[email protected]>

Expose pre-swizzled scales in modules

1a8b551

Signed-off-by: Tim Moon <[email protected]>

Fix bug in multi-swizzle

cb1254a

Signed-off-by: Tim Moon <[email protected]>

Support MXFP8 gated activations with swizzled scales

8b10300

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into tmoon/pre-swizzled-scales

1de4b5e

timmoon10 force-pushed the tmoon/pre-swizzled-scales branch from 4925b63 to 1de4b5e Compare December 10, 2025 07:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

8c6ea61

for more information, see https://pre-commit.ci

This comment was marked as outdated.

Sign in to view

timmoon10 and others added 4 commits December 10, 2025 22:22

Add PyTorch infrastructure for pre-swizzled NVFP4 tensors

a0184bc

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2365821

for more information, see https://pre-commit.ci

Deprecate DSv3-specific quantization logic in C API

bf12da9

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a89c006

for more information, see https://pre-commit.ci

This comment was marked as outdated.

Sign in to view

timmoon10 added 4 commits December 12, 2025 04:56

Remove support for DSv3 compact data from quantizer

b7eced8

Signed-off-by: Tim Moon <[email protected]>

Remove DSv3 compact data format from core lib

1da2c19

Signed-off-by: Tim Moon <[email protected]>

Fix bug in FP8 all-gather

9ed62cb

Signed-off-by: Tim Moon <[email protected]>

Fix linter warnings

43c8132

Signed-off-by: Tim Moon <[email protected]>

timmoon10 and others added 2 commits December 12, 2025 08:51

Review suggestions from @greptile-apps

736a971

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8b5e43d

for more information, see https://pre-commit.ci

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

timmoon10 and others added 7 commits December 12, 2025 23:48

Update C++ swizzle test with swizzled scales API

78b572c

Signed-off-by: Tim Moon <[email protected]>

Return default tensor params when querying params for invalid NVTETensor

d13760c

Signed-off-by: Tim Moon <[email protected]>

Debug DSv3 FP8 test failures

9cc7fe4

Signed-off-by: Tim Moon <[email protected]>

Debug Userbuffers test failures

41c8d51

Signed-off-by: Tim Moon <[email protected]>

Make sure gated activations populate FP8 transpose if needed

7b3e231

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into tmoon/pre-swizzled-scales

732425c

[pre-commit.ci] auto fixes from pre-commit.com hooks

7b55b9d

for more information, see https://pre-commit.ci

This comment was marked as resolved.

Sign in to view

Review suggestions from @greptile-apps

dc235e9

Signed-off-by: Tim Moon <[email protected]>

This comment was marked as outdated.

Sign in to view

timmoon10 added performance Performance issues MoE labels Dec 15, 2025

Disable pre-swizzling with debug quantizer

5aec484

Signed-off-by: Tim Moon <[email protected]>

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into tmoon/pre-swizzled-scales

f05fd06

Signed-off-by: Tim Moon <[email protected]>

This comment was marked as resolved.

Sign in to view

Review suggestion from @greptile-apps

c6f12e1

Signed-off-by: Tim Moon <[email protected]>

This comment was marked as resolved.

Sign in to view

zhongbozhu reviewed Jan 5, 2026

View reviewed changes

timmoon10 added 2 commits January 5, 2026 19:34

Merge branch 'main' into tmoon/pre-swizzled-scales

583e948

Signed-off-by: Tim Moon <[email protected]>

Fix merge conflicts and review suggestions

28d08a7

Update copyright years. Tweak comments. Fix various complaints from @greptile-apps. Signed-off-by: Tim Moon <[email protected]>

This comment was marked as resolved.

Sign in to view

Add logic for block-scaled tensors with GEMM swizzled scales #2486

Are you sure you want to change the base?

Add logic for block-scaled tensors with GEMM swizzled scales #2486

Uh oh!

Conversation

timmoon10 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

timmoon10 commented Dec 22, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

zhongbozhu Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

timmoon10 commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timmoon10 commented Dec 6, 2025 •

edited

Loading