Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in version 0.3.28 on aarch64 because of GEMM to GEMV transformation #4951

Closed
cdaley opened this issue Oct 23, 2024 · 7 comments · Fixed by #4955
Closed

Comments

@cdaley
Copy link

cdaley commented Oct 23, 2024

We have found a DGEMM performance regression in OpenBLAS-0.3.28 on aarch64 platforms. It happens because of the GEMM to GEMV forwarding that was introduced and enabled on aarch64 in this version of OpenBLAS. Here is the difference in performance in GFLOP/s for a single-threaded dgemm('N','T',1,91,90,1,a,50,b,91,0,c,50) with and without GEMM to GEMV forwarding:

Processor With forwarding (default) Without forwarding
Ampere Altra 0.77 4.05
AWS Graviton3 0.72 7.64
NVIDIA Grace 0.92 12.21

The results show that the forwarding can cause an order of magnitude performance loss. A perf profile shows that the time is spent in the scalar code at https://github.com/OpenMathLib/OpenBLAS/blob/v0.3.28/kernel/arm64/gemv_n.S#L295.

The GEMM to GEMV forwarding performs better after copying KERNEL.A64FX to KERNEL.NEOVERSEV1 and KERNEL.NEOVERSEV2, but it is still below the performance without GEMM to GEMV forwarding:

Processor KERNEL.A64FX
AWS Graviton3 3.11
NVIDIA Grace 4.72

A perf profile shows that the time is spent in the scalar code at: https://github.com/OpenMathLib/OpenBLAS/blob/v0.3.28/kernel/arm64/gemv_n_sve.c#L81.

It seems to be beneficial to either disable the GEMM to GEMV forwarding or restrict its usage as follows:

diff -ruN OpenBLAS-0.3.28-original/interface/gemm.c OpenBLAS-0.3.28/interface/gemm.c
--- OpenBLAS-0.3.28-original/interface/gemm.c   2024-08-08 13:41:46.000000000 -0700
+++ OpenBLAS-0.3.28/interface/gemm.c    2024-10-22 07:59:31.538919000 -0700
@@ -521,7 +521,7 @@
       GEMV(&NT, &m, &n, args.alpha, args.a, &lda, args.b, &inc_x, args.beta, args.c, &inc_y);
       return;
     }
-    if (args.m == 1) {
+    if (args.m == 1 && args.ldc == 1) {
       blasint inc_x = args.lda;
       blasint inc_y = args.ldc;
       // These were passed in as blasint, but the struct translates them to blaslong

https://github.com/OpenMathLib/OpenBLAS/blob/v0.3.28/interface/gemm.c#L524.

This appears to be different to issue #4939, which has DGEMM calls with m != 1.

@Mousius
Copy link
Contributor

Mousius commented Oct 23, 2024

Thanks for looking into this in such detail, @cdaley. The extra guard for ldc makes sense to me. Should it also be applied to args.n==1?

@cdaley
Copy link
Author

cdaley commented Oct 24, 2024

Hi @Mousius. Yes it should also get applied to args.n == 1.

My old code was not quite right as it prevented some beneficial forwarding cases. I think the new code should be something like the following:

diff -ruN OpenBLAS-0.3.28-original/interface/gemm.c OpenBLAS-0.3.28/interface/gemm.c
--- OpenBLAS-0.3.28-original/interface/gemm.c   2024-08-08 13:41:46.000000000 -0700
+++ OpenBLAS-0.3.28/interface/gemm.c    2024-10-23 16:35:08.936908000 -0700
@@ -39,6 +39,7 @@
 
 #include <stdio.h>
 #include <stdlib.h>
+#include <stdbool.h>
 #include "common.h"
 #ifdef FUNCTION_PROFILE
 #include "functable.h"
@@ -518,6 +519,8 @@
       if (transb & 1) {
         inc_x = args.ldb;
       }
+      bool is_efficient_gemv = (NT == 'T' && inc_x == 1) || (NT == 'N');
+      if (!is_efficient_gemv) goto execute_gemm;
       GEMV(&NT, &m, &n, args.alpha, args.a, &lda, args.b, &inc_x, args.beta, args.c, &inc_y);
       return;
     }
@@ -538,10 +541,15 @@
         m = args.n;
         n = args.k;
       }
+      // The following is needed to avoid significant gemv performance penalties
+      // on arm64, see https://github.com/OpenMathLib/OpenBLAS/issues/4951
+      bool is_efficient_gemv = (NT == 'T' && inc_x == 1) || (NT == 'N' && inc_y == 1);
+      if (!is_efficient_gemv) goto execute_gemm;
       GEMV(&NT, &m, &n, args.alpha, args.b, &ldb, args.a, &inc_x, args.beta, args.c, &inc_y);
       return;
     }
   }
+execute_gemm:
 #endif
 
   IDEBUG_START;

The intent of this code is to avoid the scalar code in gemv_n_sve.c and gemv_t_sve.c.

This workaround is only helpful because the arm64 gemv kernels are less tuned than the x86_64 gemv kernels.

@Mousius
Copy link
Contributor

Mousius commented Oct 24, 2024

Looks like we're getting close to a fix, can you raise this as a Pull Request @cdaley? We can continue reviewing it there 😸

@martin-frbg
Copy link
Collaborator

Thanks for the update -having it as a PR would be nice indeed, if you can spare the time. (I wonder if the if-goto could be '#if defined(ARM64)` - as long as we think it affects only that one platform. And maybe inverted to be just another conditional for invoking GEMV - but that may be an irrational fear of "goto" :) )

@Mousius
Copy link
Contributor

Mousius commented Oct 24, 2024

Thanks for the update -having it as a PR would be nice indeed, if you can spare the time. (I wonder if the if-goto could be '#if defined(ARM64)` - as long as we think it affects only that one platform. And maybe inverted to be just another conditional for invoking GEMV - but that may be an irrational fear of "goto" :) )

Fear of goto is entirely rational 😸

@martin-frbg
Copy link
Collaborator

maybe even fear of Goto - what have we done to his work

@cdaley
Copy link
Author

cdaley commented Oct 25, 2024

PR created at #4955

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants