Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add volk_64f_x2_dot_prod_64f #627

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BatchDrake
Copy link

This is basically the 64-bit version of volk_32f_x2_dot_prod_32f. Since this is my first PR to Volk and I will probably be writing a few more kernels for batched 64-bit 3D plane/rect intersections at some point, all stylistic/performance feedback is more than welcome.

This is the result of the test:

test 121
    Start 121: qa_volk_64f_x2_dot_prod_64f

121: Test command: /usr/bin/sh "/home/waldo/Documents/Development/volk/build/lib/volk_64f_x2_dot_prod_64f_test.sh" "/home/waldo/Documents/Development/volk/build/lib"
121: Test timeout computed to be: 10000000
121: RUN_VOLK_TESTS: volk_64f_x2_dot_prod_64f(131071,1)
121: generic completed in 0.210324 ms
121: u_sse completed in 0.212984 ms
121: u_sse3 completed in 0.218968 ms
121: u_sse4_1 completed in 0.210118 ms
121: u_avx completed in 0.141696 ms
121: u_avx2_fma completed in 0.178188 ms
121: a_generic completed in 0.375334 ms
121: a_sse completed in 0.290898 ms
121: a_sse3 completed in 0.38351 ms
121: a_sse4_1 completed in 0.508197 ms
121: a_avx completed in 0.493607 ms
121: a_avx2_fma completed in 0.529337 ms
121: Best aligned arch: u_avx
121: Best unaligned arch: u_avx
1/1 Test #121: qa_volk_64f_x2_dot_prod_64f ......   Passed    0.06 sec

The following tests passed:
        qa_volk_64f_x2_dot_prod_64f

100% tests passed, 0 tests failed out of 1

Total Test time (real) =   0.07 sec

PD: While I expected some improvement (33% increase), I see that most of them perform worse than the generic kernel. Don't know whether it makes sense to keep the worst ones.

Signed-off-by: Gonzalo J. Carracedo Carballal <[email protected]>
@jdemel
Copy link
Contributor

jdemel commented Aug 11, 2023

@BatchDrake thanks for this PR! I'll look into it.

Copy link
Contributor

@jdemel jdemel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again. Thanks for your PR. I hope I could add some hints.

#ifdef LV_HAVE_GENERIC


static inline void volk_64f_x2_dot_prod_64f_a_generic(double* result,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to remove this function. Old kernels have the aligned generic version sometimes. But the generic kernel should not rely on any alignment. Also, this kernel yields wildly differing results compared to the "unaligned".

Comment on lines +507 to +512
for (; number < eighthPoints; number++) {

a0Val = _mm_load_pd(aPtr);
a1Val = _mm_load_pd(aPtr + 2);
a2Val = _mm_load_pd(aPtr + 4);
a3Val = _mm_load_pd(aPtr + 6);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might actually be a source for slow results. Compilers are incredibly smart nowadays. This kind of manual "loop unrolling" might actually block some compiler optimizations.
You might want to start with godbolt.com and inspect the results for the generic kernel in case you compile for a specific SIMD extension. I'm aware that it might not be trivial to find the optimized assembly code in the output. Still, it is a possible starting point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants