Performance regression in version 0.3.28 on aarch64 because of GEMM to GEMV transformation #4951

cdaley · 2024-10-23T03:11:58Z

We have found a DGEMM performance regression in OpenBLAS-0.3.28 on aarch64 platforms. It happens because of the GEMM to GEMV forwarding that was introduced and enabled on aarch64 in this version of OpenBLAS. Here is the difference in performance in GFLOP/s for a single-threaded dgemm('N','T',1,91,90,1,a,50,b,91,0,c,50) with and without GEMM to GEMV forwarding:

Processor	With forwarding (default)	Without forwarding
Ampere Altra	0.77	4.05
AWS Graviton3	0.72	7.64
NVIDIA Grace	0.92	12.21

The results show that the forwarding can cause an order of magnitude performance loss. A perf profile shows that the time is spent in the scalar code at https://github.com/OpenMathLib/OpenBLAS/blob/v0.3.28/kernel/arm64/gemv_n.S#L295.

The GEMM to GEMV forwarding performs better after copying KERNEL.A64FX to KERNEL.NEOVERSEV1 and KERNEL.NEOVERSEV2, but it is still below the performance without GEMM to GEMV forwarding:

Processor	KERNEL.A64FX
AWS Graviton3	3.11
NVIDIA Grace	4.72

A perf profile shows that the time is spent in the scalar code at: https://github.com/OpenMathLib/OpenBLAS/blob/v0.3.28/kernel/arm64/gemv_n_sve.c#L81.

It seems to be beneficial to either disable the GEMM to GEMV forwarding or restrict its usage as follows:

diff -ruN OpenBLAS-0.3.28-original/interface/gemm.c OpenBLAS-0.3.28/interface/gemm.c
--- OpenBLAS-0.3.28-original/interface/gemm.c   2024-08-08 13:41:46.000000000 -0700
+++ OpenBLAS-0.3.28/interface/gemm.c    2024-10-22 07:59:31.538919000 -0700
@@ -521,7 +521,7 @@
       GEMV(&NT, &m, &n, args.alpha, args.a, &lda, args.b, &inc_x, args.beta, args.c, &inc_y);
       return;
     }
-    if (args.m == 1) {
+    if (args.m == 1 && args.ldc == 1) {
       blasint inc_x = args.lda;
       blasint inc_y = args.ldc;
       // These were passed in as blasint, but the struct translates them to blaslong

https://github.com/OpenMathLib/OpenBLAS/blob/v0.3.28/interface/gemm.c#L524.

This appears to be different to issue #4939, which has DGEMM calls with m != 1.

The text was updated successfully, but these errors were encountered:

Mousius · 2024-10-23T10:04:36Z

Thanks for looking into this in such detail, @cdaley. The extra guard for ldc makes sense to me. Should it also be applied to args.n==1?

cdaley · 2024-10-24T00:16:30Z

Hi @Mousius. Yes it should also get applied to args.n == 1.

My old code was not quite right as it prevented some beneficial forwarding cases. I think the new code should be something like the following:

diff -ruN OpenBLAS-0.3.28-original/interface/gemm.c OpenBLAS-0.3.28/interface/gemm.c
--- OpenBLAS-0.3.28-original/interface/gemm.c   2024-08-08 13:41:46.000000000 -0700
+++ OpenBLAS-0.3.28/interface/gemm.c    2024-10-23 16:35:08.936908000 -0700
@@ -39,6 +39,7 @@
 
 #include <stdio.h>
 #include <stdlib.h>
+#include <stdbool.h>
 #include "common.h"
 #ifdef FUNCTION_PROFILE
 #include "functable.h"
@@ -518,6 +519,8 @@
       if (transb & 1) {
         inc_x = args.ldb;
       }
+      bool is_efficient_gemv = (NT == 'T' && inc_x == 1) || (NT == 'N');
+      if (!is_efficient_gemv) goto execute_gemm;
       GEMV(&NT, &m, &n, args.alpha, args.a, &lda, args.b, &inc_x, args.beta, args.c, &inc_y);
       return;
     }
@@ -538,10 +541,15 @@
         m = args.n;
         n = args.k;
       }
+      // The following is needed to avoid significant gemv performance penalties
+      // on arm64, see https://github.com/OpenMathLib/OpenBLAS/issues/4951
+      bool is_efficient_gemv = (NT == 'T' && inc_x == 1) || (NT == 'N' && inc_y == 1);
+      if (!is_efficient_gemv) goto execute_gemm;
       GEMV(&NT, &m, &n, args.alpha, args.b, &ldb, args.a, &inc_x, args.beta, args.c, &inc_y);
       return;
     }
   }
+execute_gemm:
 #endif
 
   IDEBUG_START;

The intent of this code is to avoid the scalar code in gemv_n_sve.c and gemv_t_sve.c.

This workaround is only helpful because the arm64 gemv kernels are less tuned than the x86_64 gemv kernels.

Mousius · 2024-10-24T09:26:45Z

Looks like we're getting close to a fix, can you raise this as a Pull Request @cdaley? We can continue reviewing it there 😸

martin-frbg · 2024-10-24T10:14:22Z

Thanks for the update -having it as a PR would be nice indeed, if you can spare the time. (I wonder if the if-goto could be '#if defined(ARM64)` - as long as we think it affects only that one platform. And maybe inverted to be just another conditional for invoking GEMV - but that may be an irrational fear of "goto" :) )

Mousius · 2024-10-24T11:55:06Z

Thanks for the update -having it as a PR would be nice indeed, if you can spare the time. (I wonder if the if-goto could be '#if defined(ARM64)` - as long as we think it affects only that one platform. And maybe inverted to be just another conditional for invoking GEMV - but that may be an irrational fear of "goto" :) )

Fear of goto is entirely rational 😸

martin-frbg · 2024-10-24T12:46:54Z

maybe even fear of Goto - what have we done to his work

cdaley · 2024-10-25T04:27:46Z

PR created at #4955

cdaley mentioned this issue Oct 25, 2024

Optimize gemv forwarding on ARM64 systems #4955

Merged

martin-frbg linked a pull request Oct 25, 2024 that will close this issue

Optimize gemv forwarding on ARM64 systems #4955

Merged

martin-frbg closed this as completed in #4955 Oct 25, 2024

martin-frbg mentioned this issue Dec 10, 2024

OpenBLAS on aarch substantially slower than other BLAS flavours conda-forge/openblas-feedstock#160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in version 0.3.28 on aarch64 because of GEMM to GEMV transformation #4951

Performance regression in version 0.3.28 on aarch64 because of GEMM to GEMV transformation #4951

cdaley commented Oct 23, 2024

Mousius commented Oct 23, 2024

cdaley commented Oct 24, 2024

Mousius commented Oct 24, 2024

martin-frbg commented Oct 24, 2024

Mousius commented Oct 24, 2024

martin-frbg commented Oct 24, 2024

cdaley commented Oct 25, 2024

Performance regression in version 0.3.28 on aarch64 because of GEMM to GEMV transformation #4951

Performance regression in version 0.3.28 on aarch64 because of GEMM to GEMV transformation #4951

Comments

cdaley commented Oct 23, 2024

Mousius commented Oct 23, 2024

cdaley commented Oct 24, 2024

Mousius commented Oct 24, 2024

martin-frbg commented Oct 24, 2024

Mousius commented Oct 24, 2024

martin-frbg commented Oct 24, 2024

cdaley commented Oct 25, 2024