Add logic for block-scaled tensors with GEMM swizzled scales #2486

timmoon10 · 2025-12-06T02:48:10Z

Description

All of the supported block-scaled tensor formats (MXFP8, NVFP4, DSv3 FP8) have two ways of ordering their scaling factors:

"Compact" ordering for quantization, dequantization, and communication
"Swizzled" ordering for GEMM

The core infrastructure handles this in an ad hoc way, blindly assuming that the "right" scale ordering is used for the different operations. The PyTorch infrastructure only supports MXFP8 and NVFP4 scales are in compact order, although DSv3 FP8 does have awareness of "compact" and "GEMM-ready" formats. This situation makes it hard to implement fused kernels that can bypass the swizzle kernel.

This PR adds a with_gemm_swizzled_scales field in the C++ tensor class so that the core infrastructure can distinguish between the different scale orderings. It also adds this field in the PyTorch quantized tensor classes, and exposes a optimize_for_gemm option in the quantizer so that we can create tensors that do not need communication or checkpointing. Finally, it rips out all the DSv3 FP8 infrastructure for the compact format, which is no longer necessary.

Progress

Closes #2446.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Support GEMM swizzled scales in C++ tensor class
Support GEMM swizzled scales in PyTorch quantized tensor classes
Support optimize_for_gemm option in PyTorch quantizer
Expose PyTorch function to swizzle scales
Support MXFP8 quantization with pre-swizzled scales
Enable fused quantize+swizzle kernels in linear module and related
Remove DSv3 FP8 compact data format. It was used to avoid all-gather interleaving, which we can now fix with the swap-first-dims kernel.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2025-12-10T07:21:44Z

/te-ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2025-12-11T02:17:33Z

/te-ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps · 2025-12-12T08:24:31Z

Greptile Overview

Greptile Summary

This PR adds explicit tracking of scale ordering formats for block-scaled tensors (MXFP8, NVFP4, DSv3 FP8), distinguishing between "compact" ordering for quantization/communication and "swizzled" ordering for GEMM operations.

Key changes:

Added with_gemm_swizzled_scales field to C++ Tensor class and all PyTorch quantized tensor classes (MXFP8, NVFP4, Float8Blockwise)
Added optimize_for_gemm option to Quantizer base class, enabling fused quantize+swizzle kernels
Implemented new swizzle_scales_for_gemm and multi_tensor_swizzle_scales_for_gemm functions in swizzle.cpp
Extended MXFP8 quantization kernel to support direct swizzled scale output via WITH_GEMM_SWIZZLED_SCALES template parameter
Added runtime validation in cuBLAS GEMM that scales are in expected format
Removed deprecated Float8BlockScaleTensorFormat::COMPACT and all_gather_usage fields
Refactored FP8 blockwise all-gather to use swap-first-dims kernel instead of compact format

Benefits:

Enables bypassing the swizzle kernel when fused kernels can write swizzled scales directly
Cleaner API with explicit scale format tracking vs implicit assumptions
Simplifies distributed communication by removing compact format complexity

Confidence Score: 5/5

This PR is safe to merge - it's a well-structured refactoring that adds explicit tracking for scale formats with proper validation at API boundaries.
The changes are comprehensive and consistent across the C++ and Python layers. The PR adds runtime validation in cuBLAS GEMM to ensure scales are in the expected format, providing a safety net. All tensor classes properly propagate the new with_gemm_swizzled_scales field through view, reshape, serialization, and copy operations. The deprecated compact format removal is clean and the new swap-first-dims approach for distributed communication is simpler.
No files require special attention - the implementation is consistent across all modified files.

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/common/include/transformer_engine/transformer_engine.h	5/5	Added `kNVTEWithGEMMSwizzledScales` tensor parameter and `nvte_set_tensor_param_v2`/`nvte_get_tensor_param_v2` APIs for handling non-NVTEBasicTensor parameters. Deprecated `Float8BlockScaleTensorFormat::COMPACT`.
transformer_engine/pytorch/csrc/extensions/swizzle.cpp	5/5	New file with `swizzle_scales_for_gemm`, `multi_tensor_swizzle_scales_for_gemm`, `convert_block_scaling_to_mxfp8_tensor`, and `inplace_swizzle_scale_for_gemm` functions for scale format conversion.
transformer_engine/pytorch/quantized_tensor.py	5/5	Added `optimize_for_gemm` field to `Quantizer` base class, enabling tensors to be created with pre-swizzled scales for GEMM optimization.
transformer_engine/pytorch/tensor/mxfp8_tensor.py	5/5	Added `_with_gemm_swizzled_scales` tracking throughout MXFP8Tensor class including new, view, reshape, FSDP2 operations, serialization, and copy operations.
transformer_engine/pytorch/tensor/float8_blockwise_tensor.py	5/5	Removed `all_gather_usage` and `data_format` fields from Float8BlockQuantizer and Float8BlockwiseQTensor, simplifying the interface to always use GEMM-ready format.
transformer_engine/pytorch/distributed.py	5/5	Refactored FP8 blockwise all-gather to use `_AsyncHandle` class with post-processing. Removed compact format handling, now uses swap-first-dims kernel for interleaving fix.
transformer_engine/pytorch/csrc/quantizer.cpp	5/5	Added `optimize_for_gemm` field handling. Removed `all_gather_usage` and compact format logic from Float8BlockQuantizer. MXFP8Quantizer now uses `optimize_for_gemm` to set swizzled scales.
transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh	5/5	Added `WITH_GEMM_SWIZZLED_SCALES` template parameter and `gemm_swizzled_scale_idx` function to support fused quantize+swizzle in a single kernel.
transformer_engine/common/gemm/cublaslt_gemm.cu	5/5	Added runtime checks that MXFP8 and NVFP4 scales are in GEMM-swizzled format before GEMM execution. Ensures correct format at API boundary.

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Quantizer as Quantizer
    participant QTensor as QuantizedTensor
    participant Swizzle as swizzle_scales_for_gemm
    participant GEMM as cuBLAS GEMM

    User->>Quantizer: set optimize_for_gemm=True
    Quantizer->>QTensor: create_tensor(with_gemm_swizzled_scales=True)
    Note over QTensor: MXFP8 quantize kernel<br/>writes swizzled scales directly
    
    User->>GEMM: gemm(A, B)
    GEMM->>QTensor: check with_gemm_swizzled_scales
    alt Scales already swizzled
        GEMM->>GEMM: Use scales directly
    else Scales not swizzled
        GEMM->>Swizzle: swizzle_scales_for_gemm(tensor)
        Swizzle-->>GEMM: Return swizzled scales buffer
        Note over GEMM: Keep buffer alive during GEMM
    end
    GEMM->>GEMM: Execute cuBLAS GEMM
    GEMM-->>User: Return result

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2025-12-12T08:53:06Z

/te-ci L1

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2025-12-13T07:59:14Z

/te-ci L1

Signed-off-by: Tim Moon <[email protected]>

greptile-apps

_{65 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

timmoon10 added 9 commits December 4, 2025 21:51

Add general C API for setting tensor params

0563c1a

Signed-off-by: Tim Moon <[email protected]>

Implement general accessors for NVTETensor

5c9b1be

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into tmoon/pre-swizzled-scales

219ddc1

Refactor tex swizzling to skip if scales are already swizzled

1c49646

Signed-off-by: Tim Moon <[email protected]>

Add checks for non-swizzled scales in MXFP8 and NVFP4 kernels

5f60184

Signed-off-by: Tim Moon <[email protected]>

Support pre-swizzled scales in MXFP8Tensor

21ec928

Signed-off-by: Tim Moon <[email protected]>

Add tex function to swizzle MXFP8 scales

fa7e7c0

Signed-off-by: Tim Moon <[email protected]>

Fix bug in inplace swizzle function

b796c96

Signed-off-by: Tim Moon <[email protected]>

Tweak comments to use "compact/swizzled format"

52ce3a4

Signed-off-by: Tim Moon <[email protected]>

timmoon10 force-pushed the tmoon/pre-swizzled-scales branch from d274220 to 52ce3a4 Compare December 6, 2025 02:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

5c7c1d9

for more information, see https://pre-commit.ci

timmoon10 added enhancement New feature or request refactor labels Dec 6, 2025

timmoon10 added 5 commits December 9, 2025 21:20

MXFP8 quantize kernel with pre-swizzled scales

dfb4b94

Signed-off-by: Tim Moon <[email protected]>

Expose pre-swizzled scales in modules

1a8b551

Signed-off-by: Tim Moon <[email protected]>

Fix bug in multi-swizzle

cb1254a

Signed-off-by: Tim Moon <[email protected]>

Support MXFP8 gated activations with swizzled scales

8b10300

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into tmoon/pre-swizzled-scales

1de4b5e

timmoon10 force-pushed the tmoon/pre-swizzled-scales branch from 4925b63 to 1de4b5e Compare December 10, 2025 07:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

8c6ea61

for more information, see https://pre-commit.ci

timmoon10 and others added 4 commits December 10, 2025 22:22

Add PyTorch infrastructure for pre-swizzled NVFP4 tensors

a0184bc

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2365821

for more information, see https://pre-commit.ci

Deprecate DSv3-specific quantization logic in C API

bf12da9

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a89c006

for more information, see https://pre-commit.ci

timmoon10 added 4 commits December 12, 2025 04:56

Remove support for DSv3 compact data from quantizer

b7eced8

Signed-off-by: Tim Moon <[email protected]>

Remove DSv3 compact data format from core lib

1da2c19

Signed-off-by: Tim Moon <[email protected]>

Fix bug in FP8 all-gather

9ed62cb

Signed-off-by: Tim Moon <[email protected]>

Fix linter warnings

43c8132

Signed-off-by: Tim Moon <[email protected]>

timmoon10 and others added 2 commits December 12, 2025 08:18

Merge branch 'main' into tmoon/pre-swizzled-scales

c549e90

[pre-commit.ci] auto fixes from pre-commit.com hooks

4b06462

for more information, see https://pre-commit.ci

timmoon10 marked this pull request as ready for review December 12, 2025 08:21

This comment was marked as resolved.

Sign in to view

Review suggestion from @greptile-apps

6c11bb5

Signed-off-by: Tim Moon <[email protected]>

timmoon10 requested review from Oleg-Goncharov, ksivaman and ptrendx December 12, 2025 08:35

This comment was marked as resolved.

Sign in to view

timmoon10 and others added 2 commits December 12, 2025 08:51

Review suggestions from @greptile-apps

736a971

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8b5e43d

for more information, see https://pre-commit.ci

This comment was marked as resolved.

Sign in to view

timmoon10 and others added 7 commits December 12, 2025 23:48

Update C++ swizzle test with swizzled scales API

78b572c

Signed-off-by: Tim Moon <[email protected]>

Return default tensor params when querying params for invalid NVTETensor

d13760c

Signed-off-by: Tim Moon <[email protected]>

Debug DSv3 FP8 test failures

9cc7fe4

Signed-off-by: Tim Moon <[email protected]>

Debug Userbuffers test failures

41c8d51

Signed-off-by: Tim Moon <[email protected]>

Make sure gated activations populate FP8 transpose if needed

7b3e231

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into tmoon/pre-swizzled-scales

732425c

[pre-commit.ci] auto fixes from pre-commit.com hooks

7b55b9d

for more information, see https://pre-commit.ci

This comment was marked as resolved.

Sign in to view

Review suggestions from @greptile-apps

dc235e9

Signed-off-by: Tim Moon <[email protected]>

This comment was marked as outdated.

Sign in to view

timmoon10 added performance Performance issues MoE labels Dec 15, 2025

Disable pre-swizzling with debug quantizer

5aec484

Signed-off-by: Tim Moon <[email protected]>

greptile-apps bot reviewed Dec 15, 2025

View reviewed changes

Add logic for block-scaled tensors with GEMM swizzled scales #2486

Are you sure you want to change the base?

Add logic for block-scaled tensors with GEMM swizzled scales #2486

Uh oh!

Conversation

timmoon10 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

timmoon10 commented Dec 10, 2025

Uh oh!

timmoon10 commented Dec 11, 2025

Uh oh!

greptile-apps bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

timmoon10 commented Dec 12, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

timmoon10 commented Dec 13, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

timmoon10 commented Dec 6, 2025 •

edited

Loading

greptile-apps bot commented Dec 12, 2025 •

edited

Loading