[JAX] Add collective GEMM without compute/communication overlap #1675

philipphack · 2025-04-11T21:32:28Z

Description

Rebase of #1307:

Implements XLA custom calls in C++, and the corresponding JAX primitive including custom partitioning rules.

Custom partitioning rules for a LHS:([B,] M, K) x RHS:([B,] K, N) = OUT:([B,] M, N) batched mat-mul operation where [B] is the batch dimension:

Preserve the partitioning of the [B] dimension for all operands.
Always all-gather LHS along the M dimension.
Error out if RHS is partitioned in both K and N dimensions.
Force the K dimension of LHS to match the partitioning of the K dimension of RHS.
If K dimension is partitioned but M dimension is not, jax.lax.psum (all-reduce) the output over the TP mesh resource.
If both the M and K dimensions are partitioned, jax.lax.psum_scatter (reduce-scatter) the output over the TP mesh resource.
In practice, the RHS matrix (typically the weight tensor) should be allocated with transposed contracting dimensions ([B,] N, K) for optimal GEMM heuristics in cuBlasLt. This layout is also mandatory for FP8 inputs.

This PR does NOT update fused ops or Flax/Praxis modules to use the new GEMM custom op over the existing XLA pattern matching approach.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Added JAX primitive for the XLA custom call.
Added serial unit test.
Added distributed unit test.

Signed-off-by: Philipp Hack <[email protected]>

denera

LGTM, pending very minor docstring fix.

denera · 2025-04-15T20:14:18Z

transformer_engine/jax/cpp_extensions/gemm.py

+        use_split_accumulator,
+    ):
+        """
+        Fused attention fwd lowering rules


Looks like leftover incorrect docstring from the copied primitive template.

phu0ngng · 2025-04-15T20:48:01Z

transformer_engine/jax/cpp_extensions/quantization.py

+def _jax_cast_fp8(inputs, scale, amax, out_dtype):
+    """
+    JAX native fp8 casting implementation
+    """
+    casted_output = _jax_quantize(inputs, scale, dq_dtype=out_dtype)
+    updated_amax = jax.lax.max(amax, jnp.max(jnp.abs(inputs)).astype(amax.dtype))
+    return casted_output, updated_amax
+


Suggested change

def _jax_cast_fp8(inputs, scale, amax, out_dtype):

"""

JAX native fp8 casting implementation

"""

casted_output = _jax_quantize(inputs, scale, dq_dtype=out_dtype)

updated_amax = jax.lax.max(amax, jnp.max(jnp.abs(inputs)).astype(amax.dtype))

return casted_output, updated_amax

Please use _jax_quantize() instead.

phu0ngng · 2025-04-15T20:49:31Z

transformer_engine/jax/__init__.py

-_load_library()
+if module_name not in sys.modules:
+    _load_library()
+


Hi,
Any reasons for these changes?

phu0ngng · 2025-04-15T20:50:07Z

transformer_engine/jax/__init__.py

@@ -101,7 +103,6 @@ def _load_library():
 )

 __all__ = [
-    "fp8_autocast",


I think we do need to export fp8_autocast.

phu0ngng · 2025-04-15T20:54:07Z

transformer_engine/jax/cpp_extensions/gemm.py

+    if scaling_mode == ScalingMode.NVTE_DELAYED_TENSOR_SCALING:
+        lhs_scale_inv = lhs.scale_inv.reshape(-1)
+        rhs_scale_inv = rhs.scale_inv.reshape(-1)
+    if scaling_mode == ScalingMode.NVTE_MXFP8_1D_SCALING:
+        lhs_scale_inv = lhs_scale_inv.reshape(-1)
+        rhs_scale_inv = rhs_scale_inv.reshape(-1)


Hi,

Why do we need to reshape the scale_inv for DelayedScaling?

For MXFP8, don't we need to call swizzle_scale?

phu0ngng · 2025-04-15T20:55:32Z

transformer_engine/jax/csrc/extensions.h

+                   Result_Type out_amax_updated, Result_Type out_scale_updated,
+                   Result_Type pre_gelu_out, Result_Type bias_grad, Result_Type workspace,
+                   bool lhs_trans, bool rhs_trans, bool fuse_gelu, bool fuse_bias, bool grad,
+                   bool accumulate, bool use_split_accumulator);


Don't you need to bind the scaling_mode?

phu0ngng · 2025-04-15T20:59:35Z

transformer_engine/jax/csrc/extensions/pybind.cpp

@@ -59,6 +59,7 @@ pybind11::dict Registrations() {
      pybind11::dict(pybind11::arg("prepare") = EncapsulateFFI(CublasHandleInitHandler),
                     pybind11::arg("execute") = EncapsulateFFI(GroupedGemmHandler));

+  dict["te_gemm_ffi"] = EncapsulateFFI(GemmHandler);


Please add the prepare phase as in the te_grouped_gemm_ffi.

denera · 2025-06-03T20:23:22Z

Closing in favor of #1846

philipphack added 3 commits April 10, 2025 01:48

JAX collective GEMM without compute/communication overlap.

1971658

Signed-off-by: Philipp Hack <[email protected]>

JAX collective GEMM without compute/communication overlap.

fb7a993

Signed-off-by: Philipp Hack <[email protected]>

JAX collective GEMM without compute/communication overlap.

2c39c61

Signed-off-by: Philipp Hack <[email protected]>

ptrendx requested a review from denera April 15, 2025 19:35

denera approved these changes Apr 15, 2025

View reviewed changes

phu0ngng reviewed Apr 15, 2025

View reviewed changes

phu0ngng requested changes Apr 15, 2025

View reviewed changes

denera closed this Jun 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[JAX] Add collective GEMM without compute/communication overlap #1675

[JAX] Add collective GEMM without compute/communication overlap #1675

Uh oh!

philipphack commented Apr 11, 2025

Uh oh!

denera left a comment

Uh oh!

denera Apr 15, 2025

Uh oh!

phu0ngng Apr 15, 2025

Uh oh!

phu0ngng Apr 15, 2025

Uh oh!

phu0ngng Apr 15, 2025

Uh oh!

phu0ngng Apr 15, 2025

Uh oh!

phu0ngng Apr 15, 2025

Uh oh!

phu0ngng Apr 15, 2025

Uh oh!

denera commented Jun 3, 2025

Uh oh!

Uh oh!

[JAX] Add collective GEMM without compute/communication overlap #1675

[JAX] Add collective GEMM without compute/communication overlap #1675

Uh oh!

Conversation

philipphack commented Apr 11, 2025

Description

Type of change

Changes

Uh oh!

denera left a comment

Choose a reason for hiding this comment

Uh oh!

denera Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

denera commented Jun 3, 2025

Uh oh!

Uh oh!