Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cosine for faiss engine #2242

Open
jmazanec15 opened this issue Oct 31, 2024 · 7 comments
Open

Implement cosine for faiss engine #2242

jmazanec15 opened this issue Oct 31, 2024 · 7 comments
Assignees
Labels
Enhancements Increases software capabilities beyond original client specifications Roadmap:Vector Database/GenAI Project-wide roadmap label v2.19.0

Comments

@jmazanec15
Copy link
Member

Description

Cosine similarity is one of the more popular space types. faiss does not support it directly. Instead, they prefer to have data be normalized and then use the inner product (which is equivalent to cosine --> <u,v>/||u||||v|| = cos(theta) (https://github.com/facebookresearch/faiss/blob/2c961cc308ade8a85b3aa10a550728ce3387f625/README.md?plain=1#L11). We should figure out how to add cosine for faiss now that it is default (#2163)

We have a couple different options:

  1. We could normalize data before serializing to flat vector values
  2. We could just normalize when passing to faiss and then for exact scoring lazily compute norms and normalize
  3. We could compute the norm and then use it to divide the dot product when scoring is needed or when vectors need to be normalized.

I prefer option 3 mainly because it will allow us to use use knn vector values as synthetic source (see #1571), but also let us be efficient on search.

We would need to investigate how best to do this. One simple way to do it would be to add one extra dimension and store the normalized value there.

@jmazanec15 jmazanec15 added the Enhancements Increases software capabilities beyond original client specifications label Oct 31, 2024
@luyuncheng
Copy link
Collaborator

luyuncheng commented Oct 31, 2024

we added an ingest pipeline to normalize the data for faiss engine who want to Cosine similarity

@heemin32
Copy link
Collaborator

I prefer option 2. It is more simple as we keep everything as is and only pass normalized data to faiss engine. Could you tell why we need exact scoring lazily compute norms and normalize for option 2?

@jmazanec15
Copy link
Member Author

For option 2, for re-scoring (exact search), because the vectors are not normalized, we would need to compute norms, as we do now.

@vamshin
Copy link
Member

vamshin commented Oct 31, 2024

we added an ingest pipeline to normalize the data for faiss engine who want to Cosine similarity

@luyuncheng interesting. What are your thoughts on having this support in k-NN plugin VS Ingest pipeline approach?

@heemin32
Copy link
Collaborator

For option 2, for re-scoring (exact search), because the vectors are not normalized, we would need to compute norms, as we do now.

For re-scoring we can just use cosine distance calculation instead of (normalization + innerproduct)?

@jmazanec15
Copy link
Member Author

cosine distance calculation is implemented as normalization + innerproduct I believe

@navneet1v navneet1v moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Oct 31, 2024
@luyuncheng
Copy link
Collaborator

luyuncheng commented Nov 1, 2024

we added an ingest pipeline to normalize the data for faiss engine who want to Cosine similarity

@luyuncheng interesting. What are your thoughts on having this support in k-NN plugin VS Ingest pipeline approach?

@vamshin @jmazanec15 @heemin32 as @jmazanec15 says

cosine distance calculation is implemented as normalization + innerproduct I believe

cosine distance calculation is normalization + innerproduct also says in faiss#wiki

before i read this issues, and i do need cosine distance in faiss.

firstly i introduced normalization at faiss_wrapper#CreateIndex and do normalize before faiss::write_index like @jmazanec15 's option 1. then i found, every merge would do normalize again. it increased the refresh time.

so in the end, i introduced a new pipeline doing normalize,

PROS:

  1. it can do normalize at ingestNode, save cpu usage at datanode to build Graph
  2. not change jni code.
  3. not increase refresh time.
  4. compare to option2 option3, not increase search time

CONS:

  1. changed raw vector data
  2. need more ingestNode computation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications Roadmap:Vector Database/GenAI Project-wide roadmap label v2.19.0
Projects
Status: New
Status: Backlog (Hot)
Development

No branches or pull requests

5 participants