Skip to content

Conversation

@LizYou
Copy link

@LizYou LizYou commented Nov 25, 2025

  1. sve optimization for HNSW::MinimaxHeap::pop_min()
  2. Add prefetch for ids.data() and dis.data() to reduce memory latency

The unit test for pop_min():

$ ./faiss_test --gtest_filter=HNSW.Test_popmin*
WARNING clustering 1000 points to 40 centroids: please provide at least 1560 training points
Running main() from /home/scratch.lyou_gpu/arm/workspaces/faiss-main/build/_deps/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = HNSW.Test_popmin*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from HNSW
[ RUN ] HNSW.Test_popmin
[ OK ] HNSW.Test_popmin (0 ms)
[ RUN ] HNSW.Test_popmin_identical_distances
[ OK ] HNSW.Test_popmin_identical_distances (0 ms)
[ RUN ] HNSW.Test_popmin_infinite_distances
[ OK ] HNSW.Test_popmin_infinite_distances (0 ms)
[----------] 3 tests from HNSW (0 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (0 ms total)
[ PASSED ] 3 tests.

Performance Result:

Benchmark: cuvs bench https://github.com/rapidsai/cuvs/tree/main/cpp/bench/ann
datasets: deep-96-image
Threads No: 1 and 8
Test Machine: Nvidia Grace CPU

1 Thread

Configuration Baseline Optimized Speedup Recall
M16.efConstruction128.efSearch16 0.1647ms 0.1643ms 1.002x 0.717
M16.efConstruction128.efSearch64 0.2858ms 0.2829ms 1.010x 0.914
M16.efConstruction128.efSearch256 0.7482ms 0.7220ms 1.036x 0.982
M16.efConstruction128.efSearch1024 2.9258ms 2.6881ms 1.088x 0.996
M32.efConstruction128.efSearch16 0.1812ms 0.1802ms 1.006x 0.784
M32.efConstruction128.efSearch64 0.3297ms 0.3254ms 1.013x 0.940
M32.efConstruction128.efSearch256 0.8822ms 0.8530ms 1.034x 0.990
M32.efConstruction128.efSearch1024 3.3204ms 3.0752ms 1.080x 0.998
M32.efConstruction256.efSearch64 0.3540ms 0.3498ms 1.012x 0.954
M32.efConstruction256.efSearch256 0.9627ms 0.9392ms 1.025x 0.994

Summary (1 Thread)

  • Best speedup: 1.088x (M16.efConstruction128.efSearch1024)
  • Average speedup: ~1.020x
  • Speedup range: 1.002x - 1.088x
  • Larger efSearch values show better improvements (up to 8.8% faster)

8 Threads

Configuration Baseline Optimized Speedup Recall
M16.efConstruction128.efSearch16 0.0856ms 0.0855ms 1.001x 0.714
M16.efConstruction128.efSearch64 0.2157ms 0.2128ms 1.014x 0.911
M16.efConstruction128.efSearch256 0.7099ms 0.6857ms 1.035x 0.982
M16.efConstruction128.efSearch1024 2.9916ms 2.7619ms 1.083x 0.997
M32.efConstruction128.efSearch16 0.1047ms 0.1045ms 1.002x 0.773
M32.efConstruction128.efSearch64 0.2664ms 0.2633ms 1.012x 0.940
M32.efConstruction128.efSearch256 0.8598ms 0.8363ms 1.028x 0.990
M32.efConstruction128.efSearch1024 3.4262ms 3.1977ms 1.071x 0.998
M32.efConstruction256.efSearch64 0.2936ms 0.2911ms 1.008x 0.955
M32.efConstruction256.efSearch256 0.9519ms 0.9282ms 1.026x 0.994

Summary (8 Threads)

  • Best speedup: 1.083x (M16.efConstruction128.efSearch1024)
  • Average speedup: ~1.018x
  • Speedup range: 1.001x - 1.083x
  • Larger efSearch values show better improvements (up to 8.3% faster)

1. sve optimization for HNSW::MinimaxHeap::pop_min()
2. Add prefetch for ids.data() and dis.data() to reduce memory latency

Signed-off-by: Lizhen You <[email protected]>
@meta-cla
Copy link

meta-cla bot commented Nov 25, 2025

Hi @LizYou!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

while (i < k_size) {
svbool_t pg_iter = svwhilelt_b32_u64(i, k_size);

const size_t prefetch_iterations = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 2? please add a comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! The "2" here is the best performance I got during benchmarking. The idea is to prefetch the data certain steps ahead (here is 2) which timing is not too early and not too late for cache access. I will add some comment for explaining the usage of here.


const size_t prefetch_iterations = 2;
size_t prefetch_idx = i + prefetch_iterations * lanes;
if (prefetch_idx < k_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this if really needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for avoiding the out-of-bound addresses. The prefetch_idx within the range [i, i + lanes] is safe in the loop, however we are prefetching i + 2 * lanes which might overflow the upper bound of the loop which might waste CPU cycles for prefetch. Let me know if you still think we should remove the check

@alexanderguzhva
Copy link
Contributor

overall, lgtm

Signed-off-by: Lizhen You <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants