Skip to content

Conversation

@marma01
Copy link

@marma01 marma01 commented Nov 20, 2025

Optimized fvec_inner_product for AArch64 with NEON intrinsics and an 8-way unrolled loop.

Benchmarks (HNSW_IP index build on GIST1M) show ~37% faster build time on AWS m8g.16xlarge (Graviton4).

Before After uplift
Build time (ms) 381,110.832 239,374.388 37%

Benchmark script (measures both build and search time)

import time
import sys
import numpy as np
import faiss

try:
    from faiss.contrib.datasets_fb import DatasetGIST1M
except ImportError:
    from faiss.contrib.datasets import DatasetGIST1M

k = 10

print("load data")

ds = DatasetGIST1M()

xq = ds.get_queries()
xb = ds.get_database()
gt = ds.get_groundtruth()
xt = ds.get_train()

nq, d = xq.shape

def evaluate(index):

    t0 = time.time()
    D, I = index.search(xq, k)
    t1 = time.time()

    missing_rate = (I == -1).sum() / float(k * nq)
    recall_at_1 = (I == gt[:, :1]).sum() / float(nq)
    print("\t %7.3f ms per query, R@1 %.4f, missing rate %.4f" % (
        (t1 - t0) * 1000.0 / nq, recall_at_1, missing_rate))

print("Testing HNSW Flat (Inner Product)")

# Regenerate IP groundtruth
print("Regenerate Inner Product groundtruth...")
gt_index = faiss.IndexFlatIP(d)
gt_index.add(xb)
_, gt = gt_index.search(xq, k)

index = faiss.IndexHNSWFlat(d, 32, faiss.METRIC_INNER_PRODUCT)
index.hnsw.efConstruction = 40

print("add ")
index.verbose = True
index.add(xb)

print("search")
for efSearch in (16, 32, 64, 128, 256):
    for bounded_queue in (True, False):
        print("efSearch", efSearch, "bounded queue", bounded_queue, end=' ')
        index.hnsw.search_bounded_queue = bounded_queue
        index.hnsw.efSearch = efSearch
        evaluate(index)

@meta-cla meta-cla bot added the CLA Signed label Nov 20, 2025
float32x4_t tmp2 = vaddq_f32(sum[4], sum[5]);
float32x4_t tmp3 = vaddq_f32(sum[6], sum[7]);

float32x4_t total = vaddq_f32(vaddq_f32(tmp0, tmp1), vaddq_f32(tmp2, tmp3));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is suboptimal for d < 32, which happens inside faiss, because of many unused vaddq_f32 operations in this case. Please add a more careful handling on when to enable this manual loop unrolling.

Also, could you please confirm that a modern compiler (not something like GCC 9) really generates a worse code vs this hand-written one, because I have an impression that a modern compiler can optimize a dot product computation pretty good nowadays?
Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will add a check to enable manual loop unrolling only when d >= 32.
Regarding the compiler: yes, modern compilers do generate SIMD instructions, but they typically don’t apply deep unrolling or use multiple accumulators like the hand-written 8-way NEON version. This limits instruction-level parallelism and throughput.
For reference, here is the generated assembly from GCC 14 and Clang 19: https://godbolt.org/z/P9zPTPasa
I also ran micro-benchmarks comparing the original code and the manual NEON version, the manual NEON code achieved better performance, with the uplift increasing as the dimension grows.

- Add a check to enable manual loop unrolling only when d >= 32.
- Format code according to clang-format
@marma01 marma01 force-pushed the optimize_fvec_inner_product branch from 9405576 to 5c2554d Compare November 25, 2025 09:49
@marma01
Copy link
Author

marma01 commented Nov 25, 2025

Updated the code to:

  • Add a check to enable manual loop unrolling only when d >= 32
  • Format code according to clang-format

@pankajsingh88
Copy link
Contributor

@subhadeepkaran will this be blocked on internal SIMD optimization related refactors?

@subhadeepkaran
Copy link

@subhadeepkaran will this be blocked on internal SIMD optimization related refactors?

yep this file has also undergone a bunch of change as part of dynamic dispatch
so for now we can hold on this PR and as soon as we commit DD changes we rebase and review it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants