Test fp4: Lluo/fp4 try out #3521

lanluo-nvidia · 2025-05-15T17:28:03Z

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/impl/nvfp4_quantize.py	2025-05-15 17:28:16.606815+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/impl/nvfp4_quantize.py	2025-05-15 17:28:40.517973+00:00
@@ -140,12 +140,11 @@
    return dequantized_data


# TODO: to remove it this is to make sure our global scale and block scale calculation is correct during debugging
def _test_weights_scaling_factor(
-    weights_tensor: torch.Tensor, 
-    global_scale: torch.Tensor
+    weights_tensor: torch.Tensor, global_scale: torch.Tensor
) -> None:

    import modelopt.core.torch.quantization.qtensor.nvfp4_tensor as nvfp4_tensor
    import modelopt.onnx.quantization.quant_utils as quant_utils

@@ -192,11 +191,13 @@
    """

    import modelopt.core.torch.quantization.qtensor.nvfp4_tensor as nvfp4_tensor

    block_scale_fp8 = nvfp4_tensor.NVFP4QTensor.get_weights_scaling_factor(
-        weights_tensor, 16, global_scale,
+        weights_tensor,
+        16,
+        global_scale,
    )[0]

    weights_tensor_scaled = nvfp4_tensor.NVFP4QTensor.quantize(
        weights_tensor,
        16,
@@ -205,11 +206,13 @@
    )[0]._quantized_data

    block_scale_fp8 = get_trt_tensor(ctx, block_scale_fp8, name + "_block_scale_fp8")
    global_scale = to_torch(global_scale, None)
    global_scale = get_trt_tensor(ctx, global_scale, name + "_global_scale")
-    weights_fp4_represented_in_uint8 = get_trt_tensor(ctx, weights_tensor_scaled, name + "_weights_fp4_represented_in_uint8")
+    weights_fp4_represented_in_uint8 = get_trt_tensor(
+        ctx, weights_tensor_scaled, name + "_weights_fp4_represented_in_uint8"
+    )

    # dequantize block scale from fp8 to float32
    dequantize_block_scale_layer = ctx.net.add_dequantize(
        block_scale_fp8,
        global_scale,
@@ -248,6 +251,5 @@
    )  # amax is calculated from input_tensor.abs().amax().float()
    global_scale = torch.divide(amax, 6 * 448)
    if global_scale == 0:
        global_scale = 1.0
    return global_scale
-

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/impl/nvfp4_quantize.py	2025-05-15 21:33:37.025993+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/impl/nvfp4_quantize.py	2025-05-15 21:33:59.004002+00:00
@@ -140,12 +140,11 @@
    return dequantized_data


# TODO: to remove it this is to make sure our global scale and block scale calculation is correct during debugging
def _test_weights_scaling_factor(
-    weights_tensor: torch.Tensor, 
-    global_scale: torch.Tensor
+    weights_tensor: torch.Tensor, global_scale: torch.Tensor
) -> None:

    import modelopt.core.torch.quantization.qtensor.nvfp4_tensor as nvfp4_tensor
    import modelopt.onnx.quantization.quant_utils as quant_utils

@@ -192,11 +191,13 @@
    """

    import modelopt.core.torch.quantization.qtensor.nvfp4_tensor as nvfp4_tensor

    block_scale_fp8 = nvfp4_tensor.NVFP4QTensor.get_weights_scaling_factor(
-        weights_tensor, 16, global_scale,
+        weights_tensor,
+        16,
+        global_scale,
    )[0]

    weights_tensor_scaled = nvfp4_tensor.NVFP4QTensor.quantize(
        weights_tensor,
        16,
@@ -205,11 +206,13 @@
    )[0]._quantized_data

    block_scale_fp8 = get_trt_tensor(ctx, block_scale_fp8, name + "_block_scale_fp8")
    global_scale = to_torch(global_scale, None)
    global_scale = get_trt_tensor(ctx, global_scale, name + "_global_scale")
-    weights_fp4_represented_in_uint8 = get_trt_tensor(ctx, weights_tensor_scaled, name + "_weights_fp4_represented_in_uint8")
+    weights_fp4_represented_in_uint8 = get_trt_tensor(
+        ctx, weights_tensor_scaled, name + "_weights_fp4_represented_in_uint8"
+    )

    # dequantize block scale from fp8 to float32
    dequantize_block_scale_layer = ctx.net.add_dequantize(
        block_scale_fp8,
        global_scale,
@@ -248,6 +251,5 @@
    )  # amax is calculated from input_tensor.abs().amax().float()
    global_scale = torch.divide(amax, 6 * 448)
    if global_scale == 0:
        global_scale = 1.0
    return global_scale
-

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/impl/nvfp4_quantize.py	2025-05-15 22:36:44.918571+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/impl/nvfp4_quantize.py	2025-05-15 22:37:09.722122+00:00
@@ -140,12 +140,11 @@
    return dequantized_data


# TODO: to remove it this is to make sure our global scale and block scale calculation is correct during debugging
def _test_weights_scaling_factor(
-    weights_tensor: torch.Tensor, 
-    global_scale: torch.Tensor
+    weights_tensor: torch.Tensor, global_scale: torch.Tensor
) -> None:

    import modelopt.core.torch.quantization.qtensor.nvfp4_tensor as nvfp4_tensor
    import modelopt.onnx.quantization.quant_utils as quant_utils

@@ -192,11 +191,13 @@
    """

    import modelopt.core.torch.quantization.qtensor.nvfp4_tensor as nvfp4_tensor

    block_scale_fp8 = nvfp4_tensor.NVFP4QTensor.get_weights_scaling_factor(
-        weights_tensor, 16, global_scale,
+        weights_tensor,
+        16,
+        global_scale,
    )[0]

    weights_tensor_scaled = nvfp4_tensor.NVFP4QTensor.quantize(
        weights_tensor,
        16,
@@ -205,11 +206,13 @@
    )[0]._quantized_data

    block_scale_fp8 = get_trt_tensor(ctx, block_scale_fp8, name + "_block_scale_fp8")
    global_scale = to_torch(global_scale, None)
    global_scale = get_trt_tensor(ctx, global_scale, name + "_global_scale")
-    weights_fp4_represented_in_uint8 = get_trt_tensor(ctx, weights_tensor_scaled, name + "_weights_fp4_represented_in_uint8")
+    weights_fp4_represented_in_uint8 = get_trt_tensor(
+        ctx, weights_tensor_scaled, name + "_weights_fp4_represented_in_uint8"
+    )

    # dequantize block scale from fp8 to float32
    dequantize_block_scale_layer = ctx.net.add_dequantize(
        block_scale_fp8,
        global_scale,
@@ -248,6 +251,5 @@
    )  # amax is calculated from input_tensor.abs().amax().float()
    global_scale = torch.divide(amax, 6 * 448)
    if global_scale == 0:
        global_scale = 1.0
    return global_scale
-

lanluo-nvidia added 18 commits April 29, 2025 09:22

Add fp4 support

1d172ce

test

d2b1422

Merge branch 'main' into lluo/fp4

38617b4

upgrade modelopt

d439d96

add constant fold

5a2213e

fix the input tensor type issue

fcf0c12

test

057f35a

test

7b09862

test

d9f2ad9

test

6892a47

Merge branch 'main' into lluo/fp4

5198f9a

restructure the dynamic double quantize and static double quantize code

559ada5

add test code

bba1d79

test

f16e58a

test

06c8126

test

868949c

test

5134a2c

test

391c971

facebook-github-bot added the cla signed label May 15, 2025

github-actions bot requested a review from gs-olive May 15, 2025 17:28

github-actions bot requested changes May 15, 2025

View reviewed changes

test

38297bd

github-actions bot requested changes May 15, 2025

View reviewed changes

add print graph

095251f

github-actions bot requested changes May 15, 2025

View reviewed changes

test float16

5830211

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test fp4: Lluo/fp4 try out #3521

Test fp4: Lluo/fp4 try out #3521

lanluo-nvidia commented May 15, 2025

github-actions bot left a comment

github-actions bot left a comment

github-actions bot left a comment

Test fp4: Lluo/fp4 try out #3521

Are you sure you want to change the base?

Test fp4: Lluo/fp4 try out #3521

Conversation

lanluo-nvidia commented May 15, 2025

Description

Type of change

Checklist:

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment