Vectorized matmul performance regression - function inlining #883

jtuyls · 2024-11-07T15:12:22Z

We're seeing performance regression on vectorized matmul, likely caused by the following PR: #856, see table below:

Matmul problem size: 512x512x4096 (MxKxN)
Array configuration: 2x2
Vectorization or ukernel or scalar: Vectorization

Commit	Latency (us)
`12f0502`	48521
`2086718`	42513

@Abhishek-Varma

Note that there is another PR causing performance regression: #882, which is likely orthogonal.

The text was updated successfully, but these errors were encountered:

newling · 2024-11-07T15:42:59Z

Can someone please remind me -- at what granularity is the matmul outlined? Is it at the m=n=k=64 granularity or the m=n=4 k=8 granularity (assuming phoenix bf16) ?

Abhishek-Varma · 2024-11-07T15:50:52Z

Can someone please remind me -- at what granularity is the matmul outlined? Is it at the m=n=k=64 granularity or the m=n=4 k=8 granularity (assuming phoenix bf16) ?

It takes place at the latter granularity.

Here's the outlined matmul :-

func.func private @generic_matmul_outlined(%arg0: memref<1x1x4x4x4x8xbf16, 2 : i32>, %arg1: memref<1x1x4x4x8x4xbf16, 2 : i32>, %arg2: memref<1x1x4x4x4x4xf32, 2 : i32>) {
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%arg0, %arg1 : memref<1x1x4x4x4x8xbf16, 2 : i32>, memref<1x1x4x4x8x4xbf16, 2 : i32>) outs(%arg2 : memref<1x1x4x4x4x4xf32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[32, 32], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [16, 16, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: bf16, %in_0: bf16, %out: f32):
    %0 = arith.extf %in : bf16 to f32
    %1 = arith.extf %in_0 : bf16 to f32
    %2 = arith.mulf %0, %1 : f32
    %3 = arith.addf %out, %2 : f32
    linalg.yield %3 : f32
  }
  return
}

Here's an e2e log (created earlier) for reference.

Abhishek-Varma · 2024-11-08T11:46:54Z

Not sure, but perhaps this might be the reason behind (and hopefully a fix for) this regression.

newling · 2024-11-08T16:33:37Z

Maybe, but it isn't surprising to me that outlining a single AIE instruction (matmul on 4x8x4) can result in a slow down

Abhishek-Varma · 2024-11-08T17:41:54Z

Maybe, but it isn't surprising to me that outlining a single AIE instruction (matmul on 4x8x4) can result in a slow down

Yeah, I guess outlining functions would definitely add some regression because of function invocation overhead.

As it was initially attempted to reduce the program memory requirement, it can definitely introduce performance overhead - perhaps the way forward should be "conditional" enabling of function outlining for now while the peano loop unrolling control is enabled ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized matmul performance regression - function inlining #883

Vectorized matmul performance regression - function inlining #883

jtuyls commented Nov 7, 2024 •

edited

Loading

newling commented Nov 7, 2024

Abhishek-Varma commented Nov 7, 2024

Abhishek-Varma commented Nov 8, 2024

newling commented Nov 8, 2024

Abhishek-Varma commented Nov 8, 2024

Vectorized matmul performance regression - function inlining #883

Vectorized matmul performance regression - function inlining #883

Comments

jtuyls commented Nov 7, 2024 • edited Loading

newling commented Nov 7, 2024

Abhishek-Varma commented Nov 7, 2024

Abhishek-Varma commented Nov 8, 2024

newling commented Nov 8, 2024

Abhishek-Varma commented Nov 8, 2024

jtuyls commented Nov 7, 2024 •

edited

Loading