[DispatchCreation] Modify the generated fused op to not use concats. #19980

MaheshRavishankar · 2025-02-13T04:31:20Z

This is an almost complete rewrite of the pass to fuse contractions horizontally which instead of concatenating operands to map to a GEMM, followed by slices to extract the individual matmul results; the pass now just creates a new operation with the operands being the common LHS, the RHS of each of the gemms, and the output of each of the gemms. The generated op yields the result of each constituent matmul.
This also allows for the RHS/output indexing maps of the gemms to be mismatched, since only the LHS operand and indexing maps need to match. The change also permutes the iteration space of the gemms to ensure that the same indexing maps are used for the LHS across all the fused matmuls.

The rest of the compiler stack has already been fixed up to handle such operations.

MaheshRavishankar · 2025-02-13T04:32:53Z

It might be better to just review the new changes by themselves and ignore the diff. The pass is essentially rewritten.

compiler/src/iree/compiler/DispatchCreation/FuseHorizontalContractions.cpp

IanWood1 · 2025-02-14T02:53:37Z

There's a problem with how ops are grouped. allOps never gets updated with the other ops determined to be fusible with the root op. Also, the candidates to fuse need to be iterated over in dominance order to ensure that, using the example below, %3 gets grouped before %4

util.func public @test_partial_horizontal_fuse(%arg0: tensor<640x640xf32>, %arg1: tensor<640x640xf32>, %arg2: tensor<640x640xf32>, %arg3: tensor<640x640xf32>) -> (tensor<640x640xf32>, tensor<640x640xf32>, tensor<640x640xf32>) {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = tensor.empty() : tensor<640x640xf32>
  %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<640x640xf32>) -> tensor<640x640xf32>
  %2 = linalg.matmul ins(%arg0, %arg1 : tensor<640x640xf32>, tensor<640x640xf32>) outs(%1 : tensor<640x640xf32>) -> tensor<640x640xf32>
  %3 = linalg.matmul ins(%arg0, %arg2 : tensor<640x640xf32>, tensor<640x640xf32>) outs(%1 : tensor<640x640xf32>) -> tensor<640x640xf32>
  %4 = linalg.matmul ins(%arg0, %3 : tensor<640x640xf32>, tensor<640x640xf32>) outs(%1 : tensor<640x640xf32>) -> tensor<640x640xf32>
  util.return %2, %3, %4 : tensor<640x640xf32>, tensor<640x640xf32>, tensor<640x640xf32>
}

This isn't directly related to the changes you made, I think this problem is on main too.

MaheshRavishankar · 2025-02-14T05:13:33Z

There's a problem with how ops are grouped. allOps never gets updated with the other ops determined to be fusible with the root op. Also, the candidates to fuse need to be iterated over in dominance order to ensure that, using the example below, %3 gets grouped before %4

util.func public @test_partial_horizontal_fuse(%arg0: tensor<640x640xf32>, %arg1: tensor<640x640xf32>, %arg2: tensor<640x640xf32>, %arg3: tensor<640x640xf32>) -> (tensor<640x640xf32>, tensor<640x640xf32>, tensor<640x640xf32>) {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = tensor.empty() : tensor<640x640xf32>
  %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<640x640xf32>) -> tensor<640x640xf32>
  %2 = linalg.matmul ins(%arg0, %arg1 : tensor<640x640xf32>, tensor<640x640xf32>) outs(%1 : tensor<640x640xf32>) -> tensor<640x640xf32>
  %3 = linalg.matmul ins(%arg0, %arg2 : tensor<640x640xf32>, tensor<640x640xf32>) outs(%1 : tensor<640x640xf32>) -> tensor<640x640xf32>
  %4 = linalg.matmul ins(%arg0, %3 : tensor<640x640xf32>, tensor<640x640xf32>) outs(%1 : tensor<640x640xf32>) -> tensor<640x640xf32>
  util.return %2, %3, %4 : tensor<640x640xf32>, tensor<640x640xf32>, tensor<640x640xf32>
}

This isn't directly related to the changes you made, I think this problem is on main too.

Good catch. Let me see if I can fix that.

compiler/src/iree/compiler/DispatchCreation/test/fuse_horizontal_contractions.mlir

hanhanW

We are (already) missing the documentation of the pass. The PR description looks good to me. Can you add such documentation to Passes.td? I.e., add a description section to the pass definition.

iree/compiler/src/iree/compiler/DispatchCreation/Passes.td

Lines 68 to 84 in 6ebfcaa

    
           def FuseHorizontalContractionsPass: 
        
               InterfacePass<"iree-dispatch-creation-fuse-horizontal-contractions", "mlir::FunctionOpInterface"> { 
        
             let summary = "Fuses horizontal contraction ops without fusions"; 
        
             let dependentDialects = [ 
        
               "mlir::arith::ArithDialect", 
        
               "mlir::tensor::TensorDialect", 
        
             ]; 
        
             let options = [ 
        
               Option<"fusionLimit", "fusion-limit", "int", 
        
                       /*default=*/"3", "Maximum number of contractions fused into one"> 
        
             ]; 
        
             let statistics = [ 
        
               Statistic<"numFusionGroups", "num-fusion-groups", "Number of fusion groups found">, 
        
               Statistic<"numSize2FusionGroups", "num-size-2-groups", "Number of fusion groups of size 2">, 
        
               Statistic<"numSize3FusionGroups", "num-size-3-groups", "Number of fusion groups of size 3"> 
        
             ]; 
        
           }

compiler/src/iree/compiler/DispatchCreation/FuseHorizontalContractions.cpp

MaheshRavishankar · 2025-02-14T19:57:11Z

There's a problem with how ops are grouped. allOps never gets updated with the other ops determined to be fusible with the root op. Also, the candidates to fuse need to be iterated over in dominance order to ensure that, using the example below, %3 gets grouped before %4

util.func public @test_partial_horizontal_fuse(%arg0: tensor<640x640xf32>, %arg1: tensor<640x640xf32>, %arg2: tensor<640x640xf32>, %arg3: tensor<640x640xf32>) -> (tensor<640x640xf32>, tensor<640x640xf32>, tensor<640x640xf32>) {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = tensor.empty() : tensor<640x640xf32>
  %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<640x640xf32>) -> tensor<640x640xf32>
  %2 = linalg.matmul ins(%arg0, %arg1 : tensor<640x640xf32>, tensor<640x640xf32>) outs(%1 : tensor<640x640xf32>) -> tensor<640x640xf32>
  %3 = linalg.matmul ins(%arg0, %arg2 : tensor<640x640xf32>, tensor<640x640xf32>) outs(%1 : tensor<640x640xf32>) -> tensor<640x640xf32>
  %4 = linalg.matmul ins(%arg0, %3 : tensor<640x640xf32>, tensor<640x640xf32>) outs(%1 : tensor<640x640xf32>) -> tensor<640x640xf32>
  util.return %2, %3, %4 : tensor<640x640xf32>, tensor<640x640xf32>, tensor<640x640xf32>
}

This isn't directly related to the changes you made, I think this problem is on main too.

@IanWood1 pushed a fix for this issue.

IanWood1 · 2025-02-14T22:39:15Z

I merged this into #19847 (I think these changes enable more horizontal fusion in punet) and got a few failing dispatches https://gist.github.com/IanWood1/2ddd601970b9d0197cf01aa91346e7e8.

They are smaller sized so I think there going down a different pipeline

MaheshRavishankar · 2025-02-15T03:23:33Z

I merged this into #19847 (I think these changes enable more horizontal fusion in punet) and got a few failing dispatches https://gist.github.com/IanWood1/2ddd601970b9d0197cf01aa91346e7e8.

They are smaller sized so I think there going down a different pipeline

Really. I have been trying this on punet locally. I didnt see any issue there.

MaheshRavishankar · 2025-02-15T23:17:35Z

I merged this into #19847 (I think these changes enable more horizontal fusion in punet) and got a few failing dispatches https://gist.github.com/IanWood1/2ddd601970b9d0197cf01aa91346e7e8.

They are smaller sized so I think there going down a different pipeline

Ok, I understand what you are saying now. I think we will need the tile and fuse pipeline to handle this operation.

hanhanW

Looks good, just few nits + a question about a check.

compiler/src/iree/compiler/DispatchCreation/FuseHorizontalContractions.cpp

qedawkins

Cool, LGTM % nits!

compiler/src/iree/compiler/DispatchCreation/FuseHorizontalContractions.cpp

MaheshRavishankar

Thanks for the reviews!

The change also allows doing horizontal fusion in cases where the LHS operand is the same, but the RHS/Outputs might be transposed. Signed-off-by: MaheshRavishankar <[email protected]>

Signed-off-by: MaheshRavishankar <[email protected]>

… this yet. Previous implementation of horizontal fusion missed opportunities for horizontal fusion in SD3, but now they do get picked up, but the backend doesnt work on these. Dropping the flag is a no-op for the test since there was no horizontal fusion to start with. Signed-off-by: MaheshRavishankar <[email protected]>

Signed-off-by: MaheshRavishankar <[email protected]>

hanhanW

LG, just some optional nits.

compiler/src/iree/compiler/DispatchCreation/FuseHorizontalContractions.cpp

Signed-off-by: MaheshRavishankar <[email protected]>

MaheshRavishankar requested review from hanhanW and IanWood1 as code owners February 13, 2025 04:31

MaheshRavishankar requested review from qedawkins and Groverkss February 13, 2025 04:31

IanWood1 reviewed Feb 13, 2025

View reviewed changes

MaheshRavishankar requested a review from IanWood1 February 13, 2025 22:31

MaheshRavishankar requested review from benvanik and stellaraccident as code owners February 13, 2025 23:43

MaheshRavishankar force-pushed the shared/noconcatHorizontalFusionChanges branch from 9ebc531 to bfca1d5 Compare February 14, 2025 00:17

qedawkins reviewed Feb 14, 2025

View reviewed changes

compiler/src/iree/compiler/DispatchCreation/test/fuse_horizontal_contractions.mlir Show resolved Hide resolved

hanhanW reviewed Feb 14, 2025

View reviewed changes

MaheshRavishankar force-pushed the shared/noconcatHorizontalFusionChanges branch from bfca1d5 to dcfa537 Compare February 14, 2025 19:56

MaheshRavishankar force-pushed the shared/noconcatHorizontalFusionChanges branch from 968c63a to dd5341d Compare February 15, 2025 22:02

MaheshRavishankar requested review from hanhanW and qedawkins February 17, 2025 17:01

hanhanW reviewed Feb 17, 2025

View reviewed changes

qedawkins approved these changes Feb 17, 2025

View reviewed changes

MaheshRavishankar force-pushed the shared/noconcatHorizontalFusionChanges branch from dd5341d to 2e943c9 Compare February 18, 2025 00:31

MaheshRavishankar commented Feb 18, 2025

View reviewed changes

MaheshRavishankar and others added 4 commits February 17, 2025 18:37

[DispatchCreation] Modify the generated fused op to not use concats.

0b3b957

The change also allows doing horizontal fusion in cases where the LHS operand is the same, but the RHS/Outputs might be transposed. Signed-off-by: MaheshRavishankar <[email protected]>

Avoid horizontal fusion when n dimensions mismatch.

13c17f7

Signed-off-by: MaheshRavishankar <[email protected]>

Address comments (2).

58d9704

Signed-off-by: MaheshRavishankar <[email protected]>

MaheshRavishankar added 2 commits February 17, 2025 18:38

Fix partial horizontal fusion dependence violation.

aabc15a

Signed-off-by: MaheshRavishankar <[email protected]>

Address comments (3)

160f9b7

Signed-off-by: MaheshRavishankar <[email protected]>

MaheshRavishankar force-pushed the shared/noconcatHorizontalFusionChanges branch from 2e943c9 to ea20fec Compare February 18, 2025 00:38

MaheshRavishankar requested a review from hanhanW February 18, 2025 00:38

hanhanW approved these changes Feb 18, 2025

View reviewed changes

Address comments (4)

ac77783

Signed-off-by: MaheshRavishankar <[email protected]>

MaheshRavishankar force-pushed the shared/noconcatHorizontalFusionChanges branch from ea20fec to ac77783 Compare February 18, 2025 21:22

MaheshRavishankar merged commit b85c180 into iree-org:main Feb 18, 2025
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DispatchCreation] Modify the generated fused op to not use concats. #19980

[DispatchCreation] Modify the generated fused op to not use concats. #19980

MaheshRavishankar commented Feb 13, 2025

MaheshRavishankar commented Feb 13, 2025

IanWood1 commented Feb 14, 2025

MaheshRavishankar commented Feb 14, 2025

hanhanW left a comment

MaheshRavishankar commented Feb 14, 2025

IanWood1 commented Feb 14, 2025 •

edited

Loading

MaheshRavishankar commented Feb 15, 2025

MaheshRavishankar commented Feb 15, 2025

hanhanW left a comment

qedawkins left a comment

MaheshRavishankar left a comment

hanhanW left a comment

	def FuseHorizontalContractionsPass:
	InterfacePass<"iree-dispatch-creation-fuse-horizontal-contractions", "mlir::FunctionOpInterface"> {
	let summary = "Fuses horizontal contraction ops without fusions";
	let dependentDialects = [
	"mlir::arith::ArithDialect",
	"mlir::tensor::TensorDialect",
	];
	let options = [
	Option<"fusionLimit", "fusion-limit", "int",
	/default=/"3", "Maximum number of contractions fused into one">
	];
	let statistics = [
	Statistic<"numFusionGroups", "num-fusion-groups", "Number of fusion groups found">,
	Statistic<"numSize2FusionGroups", "num-size-2-groups", "Number of fusion groups of size 2">,
	Statistic<"numSize3FusionGroups", "num-size-3-groups", "Number of fusion groups of size 3">
	];
	}

[DispatchCreation] Modify the generated fused op to not use concats. #19980

[DispatchCreation] Modify the generated fused op to not use concats. #19980

Conversation

MaheshRavishankar commented Feb 13, 2025

MaheshRavishankar commented Feb 13, 2025

IanWood1 commented Feb 14, 2025

MaheshRavishankar commented Feb 14, 2025

hanhanW left a comment

Choose a reason for hiding this comment

MaheshRavishankar commented Feb 14, 2025

IanWood1 commented Feb 14, 2025 • edited Loading

MaheshRavishankar commented Feb 15, 2025

MaheshRavishankar commented Feb 15, 2025

hanhanW left a comment

Choose a reason for hiding this comment

qedawkins left a comment

Choose a reason for hiding this comment

MaheshRavishankar left a comment

Choose a reason for hiding this comment

hanhanW left a comment

Choose a reason for hiding this comment

IanWood1 commented Feb 14, 2025 •

edited

Loading