-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in version 0.3.28 on aarch64 because of GEMM to GEMV transformation #4951
Comments
Thanks for looking into this in such detail, @cdaley. The extra guard for |
Hi @Mousius. Yes it should also get applied to args.n == 1. My old code was not quite right as it prevented some beneficial forwarding cases. I think the new code should be something like the following:
The intent of this code is to avoid the scalar code in gemv_n_sve.c and gemv_t_sve.c. This workaround is only helpful because the arm64 gemv kernels are less tuned than the x86_64 gemv kernels. |
Looks like we're getting close to a fix, can you raise this as a Pull Request @cdaley? We can continue reviewing it there 😸 |
Thanks for the update -having it as a PR would be nice indeed, if you can spare the time. (I wonder if the if-goto could be '#if defined(ARM64)` - as long as we think it affects only that one platform. And maybe inverted to be just another conditional for invoking GEMV - but that may be an irrational fear of "goto" :) ) |
Fear of |
maybe even fear of Goto - what have we done to his work |
PR created at #4955 |
We have found a DGEMM performance regression in OpenBLAS-0.3.28 on aarch64 platforms. It happens because of the GEMM to GEMV forwarding that was introduced and enabled on aarch64 in this version of OpenBLAS. Here is the difference in performance in GFLOP/s for a single-threaded dgemm('N','T',1,91,90,1,a,50,b,91,0,c,50) with and without GEMM to GEMV forwarding:
The results show that the forwarding can cause an order of magnitude performance loss. A perf profile shows that the time is spent in the scalar code at https://github.com/OpenMathLib/OpenBLAS/blob/v0.3.28/kernel/arm64/gemv_n.S#L295.
The GEMM to GEMV forwarding performs better after copying KERNEL.A64FX to KERNEL.NEOVERSEV1 and KERNEL.NEOVERSEV2, but it is still below the performance without GEMM to GEMV forwarding:
A perf profile shows that the time is spent in the scalar code at: https://github.com/OpenMathLib/OpenBLAS/blob/v0.3.28/kernel/arm64/gemv_n_sve.c#L81.
It seems to be beneficial to either disable the GEMM to GEMV forwarding or restrict its usage as follows:
https://github.com/OpenMathLib/OpenBLAS/blob/v0.3.28/interface/gemm.c#L524.
This appears to be different to issue #4939, which has DGEMM calls with m != 1.
The text was updated successfully, but these errors were encountered: