Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized matmul performance regression - function inlining #883

Open
jtuyls opened this issue Nov 7, 2024 · 5 comments
Open

Vectorized matmul performance regression - function inlining #883

jtuyls opened this issue Nov 7, 2024 · 5 comments

Comments

@jtuyls
Copy link
Collaborator

jtuyls commented Nov 7, 2024

We're seeing performance regression on vectorized matmul, likely caused by the following PR: #856, see table below:

Matmul problem size: 512x512x4096 (MxKxN)
Array configuration: 2x2
Vectorization or ukernel or scalar: Vectorization

Commit Latency (us)
12f0502 48521
2086718 42513

@Abhishek-Varma

Note that there is another PR causing performance regression: #882, which is likely orthogonal.

@newling
Copy link
Contributor

newling commented Nov 7, 2024

Can someone please remind me -- at what granularity is the matmul outlined? Is it at the m=n=k=64 granularity or the m=n=4 k=8 granularity (assuming phoenix bf16) ?

@Abhishek-Varma
Copy link
Contributor

Can someone please remind me -- at what granularity is the matmul outlined? Is it at the m=n=k=64 granularity or the m=n=4 k=8 granularity (assuming phoenix bf16) ?

It takes place at the latter granularity.

Here's the outlined matmul :-

func.func private @generic_matmul_outlined(%arg0: memref<1x1x4x4x4x8xbf16, 2 : i32>, %arg1: memref<1x1x4x4x8x4xbf16, 2 : i32>, %arg2: memref<1x1x4x4x4x4xf32, 2 : i32>) {
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%arg0, %arg1 : memref<1x1x4x4x4x8xbf16, 2 : i32>, memref<1x1x4x4x8x4xbf16, 2 : i32>) outs(%arg2 : memref<1x1x4x4x4x4xf32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[32, 32], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [16, 16, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: bf16, %in_0: bf16, %out: f32):
    %0 = arith.extf %in : bf16 to f32
    %1 = arith.extf %in_0 : bf16 to f32
    %2 = arith.mulf %0, %1 : f32
    %3 = arith.addf %out, %2 : f32
    linalg.yield %3 : f32
  }
  return
}

Here's an e2e log (created earlier) for reference.

@Abhishek-Varma
Copy link
Contributor

Not sure, but perhaps this might be the reason behind (and hopefully a fix for) this regression.

@newling
Copy link
Contributor

newling commented Nov 8, 2024

Maybe, but it isn't surprising to me that outlining a single AIE instruction (matmul on 4x8x4) can result in a slow down

@Abhishek-Varma
Copy link
Contributor

Maybe, but it isn't surprising to me that outlining a single AIE instruction (matmul on 4x8x4) can result in a slow down

Yeah, I guess outlining functions would definitely add some regression because of function invocation overhead.

As it was initially attempted to reduce the program memory requirement, it can definitely introduce performance overhead - perhaps the way forward should be "conditional" enabling of function outlining for now while the peano loop unrolling control is enabled ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants