sve optimization for HNSW::MinimaxHeap::pop_min() #4699

LizYou · 2025-11-25T01:47:47Z

sve optimization for HNSW::MinimaxHeap::pop_min()
Add prefetch for ids.data() and dis.data() to reduce memory latency

The unit test for pop_min():

$ ./faiss_test --gtest_filter=HNSW.Test_popmin*
WARNING clustering 1000 points to 40 centroids: please provide at least 1560 training points
Running main() from /home/scratch.lyou_gpu/arm/workspaces/faiss-main/build/_deps/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = HNSW.Test_popmin*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from HNSW
[ RUN ] HNSW.Test_popmin
[ OK ] HNSW.Test_popmin (0 ms)
[ RUN ] HNSW.Test_popmin_identical_distances
[ OK ] HNSW.Test_popmin_identical_distances (0 ms)
[ RUN ] HNSW.Test_popmin_infinite_distances
[ OK ] HNSW.Test_popmin_infinite_distances (0 ms)
[----------] 3 tests from HNSW (0 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (0 ms total)
[ PASSED ] 3 tests.

Performance Result:

Benchmark: cuvs bench https://github.com/rapidsai/cuvs/tree/main/cpp/bench/ann
datasets: deep-96-image
Threads No: 1 and 8
Test Machine: Nvidia Grace CPU

1 Thread

Configuration	Baseline	Optimized	Speedup	Recall
M16.efConstruction128.efSearch16	0.1647ms	0.1643ms	1.002x	0.717
M16.efConstruction128.efSearch64	0.2858ms	0.2829ms	1.010x	0.914
M16.efConstruction128.efSearch256	0.7482ms	0.7220ms	1.036x	0.982
M16.efConstruction128.efSearch1024	2.9258ms	2.6881ms	1.088x	0.996
M32.efConstruction128.efSearch16	0.1812ms	0.1802ms	1.006x	0.784
M32.efConstruction128.efSearch64	0.3297ms	0.3254ms	1.013x	0.940
M32.efConstruction128.efSearch256	0.8822ms	0.8530ms	1.034x	0.990
M32.efConstruction128.efSearch1024	3.3204ms	3.0752ms	1.080x	0.998
M32.efConstruction256.efSearch64	0.3540ms	0.3498ms	1.012x	0.954
M32.efConstruction256.efSearch256	0.9627ms	0.9392ms	1.025x	0.994

Summary (1 Thread)

Best speedup: 1.088x (M16.efConstruction128.efSearch1024)
Average speedup: ~1.020x
Speedup range: 1.002x - 1.088x
Larger efSearch values show better improvements (up to 8.8% faster)

8 Threads

Configuration	Baseline	Optimized	Speedup	Recall
M16.efConstruction128.efSearch16	0.0856ms	0.0855ms	1.001x	0.714
M16.efConstruction128.efSearch64	0.2157ms	0.2128ms	1.014x	0.911
M16.efConstruction128.efSearch256	0.7099ms	0.6857ms	1.035x	0.982
M16.efConstruction128.efSearch1024	2.9916ms	2.7619ms	1.083x	0.997
M32.efConstruction128.efSearch16	0.1047ms	0.1045ms	1.002x	0.773
M32.efConstruction128.efSearch64	0.2664ms	0.2633ms	1.012x	0.940
M32.efConstruction128.efSearch256	0.8598ms	0.8363ms	1.028x	0.990
M32.efConstruction128.efSearch1024	3.4262ms	3.1977ms	1.071x	0.998
M32.efConstruction256.efSearch64	0.2936ms	0.2911ms	1.008x	0.955
M32.efConstruction256.efSearch256	0.9519ms	0.9282ms	1.026x	0.994

Summary (8 Threads)

Best speedup: 1.083x (M16.efConstruction128.efSearch1024)
Average speedup: ~1.018x
Speedup range: 1.001x - 1.083x
Larger efSearch values show better improvements (up to 8.3% faster)

1. sve optimization for HNSW::MinimaxHeap::pop_min() 2. Add prefetch for ids.data() and dis.data() to reduce memory latency Signed-off-by: Lizhen You <[email protected]>

meta-cla · 2025-11-25T01:47:52Z

Hi @LizYou!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

alexanderguzhva · 2025-11-26T17:10:52Z

faiss/impl/HNSW.cpp

+    while (i < k_size) {
+        svbool_t pg_iter = svwhilelt_b32_u64(i, k_size);
+
+        const size_t prefetch_iterations = 2;


why 2? please add a comment

Thanks for the review! The "2" here is the best performance I got during benchmarking. The idea is to prefetch the data certain steps ahead (here is 2) which timing is not too early and not too late for cache access. I will add some comment for explaining the usage of here.

alexanderguzhva · 2025-11-26T17:11:18Z

faiss/impl/HNSW.cpp

+
+        const size_t prefetch_iterations = 2;
+        size_t prefetch_idx = i + prefetch_iterations * lanes;
+        if (prefetch_idx < k_size) {


is this if really needed?

This is for avoiding the out-of-bound addresses. The prefetch_idx within the range [i, i + lanes] is safe in the loop, however we are prefetching i + 2 * lanes which might overflow the upper bound of the loop which might waste CPU cycles for prefetch. Let me know if you still think we should remove the check

alexanderguzhva · 2025-11-26T17:13:18Z

overall, lgtm

Signed-off-by: Lizhen You <[email protected]>

sve optimization for HNSW::MinimaxHeap::pop_min()

6008236

1. sve optimization for HNSW::MinimaxHeap::pop_min() 2. Add prefetch for ids.data() and dis.data() to reduce memory latency Signed-off-by: Lizhen You <[email protected]>

Merge branch 'main' into popmin_sve_new

412faa6

alexanderguzhva reviewed Nov 26, 2025

View reviewed changes

Address review comments

ec846d6

Signed-off-by: Lizhen You <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sve optimization for HNSW::MinimaxHeap::pop_min() #4699

sve optimization for HNSW::MinimaxHeap::pop_min() #4699

LizYou commented Nov 25, 2025

Uh oh!

meta-cla bot commented Nov 25, 2025

Uh oh!

alexanderguzhva Nov 26, 2025

Uh oh!

LizYou Nov 27, 2025

Uh oh!

alexanderguzhva Nov 26, 2025

Uh oh!

LizYou Nov 27, 2025

Uh oh!

alexanderguzhva commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sve optimization for HNSW::MinimaxHeap::pop_min() #4699

Are you sure you want to change the base?

sve optimization for HNSW::MinimaxHeap::pop_min() #4699

Conversation

LizYou commented Nov 25, 2025

The unit test for pop_min():

Performance Result:

1 Thread

Summary (1 Thread)

8 Threads

Summary (8 Threads)

Uh oh!

meta-cla bot commented Nov 25, 2025

Action Required

Process

Uh oh!

alexanderguzhva Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

LizYou Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

alexanderguzhva Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

LizYou Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

alexanderguzhva commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants