Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce blog post for disk-based k-NN #3616

Merged
merged 17 commits into from
Feb 19, 2025

Conversation

jmazanec15
Copy link
Member

Description

Adds a blog for disk-based vector search

Issues Resolved

#3615

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License.

Adds a blog post for disk-based k-NN. Included is a set of results and
images.

Signed-off-by: John Mazanec <[email protected]>

| Metric/Configuration | in-memory | on_disk_8x | in_memory_8x | on_disk_16x | in_memory_16x | on_disk_32x | in_memory_32x |
|-----------------------------------|-----------|------------|--------------|-------------|---------------|-------------|---------------|
| recall@100 (ratio) | 0.95 | 0.98 | 0.98 | 0.97 | 0.96 | 0.94 | 0.95 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows that 32x compression just works and we should not add on_disk here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added both just for sake of transparency. For some data sets, the re-scoring does not significantly help.

@kolchfa-aws kolchfa-aws self-assigned this Feb 4, 2025
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Co-authored-by: John Mazanec <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws @jmazanec15 Editorial review complete. Please see my comments and changes and let me know if you have any questions. Thanks!

Cc: @pajuric

Interestingly, for this dataset, the on-disk approach with rescoring produces similar recall to the in-memory approach without rescoring, but the in-memory approach is substantially faster. This is most likely because the Cohere v3 model has been optimized to work very well with binary quantized data (see [this blog post](https://cohere.com/blog/int8-binary-embeddings)).

## Learnings

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 260: We haven't actually referenced ANN prior to this. Instead of "ANN approach", do we mean "nearest neighbor approach"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approximate nearest neighbor search

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
@kolchfa-aws
Copy link
Collaborator

@pajuric Could you please edit the meta for this blog, and it will be ready to publish. Thanks!

@kolchfa-aws kolchfa-aws removed their assignment Feb 17, 2025
@jmazanec15
Copy link
Member Author

@pajuric @kolchfa-aws Are there any todos on this one - or can we publish?

@kolchfa-aws
Copy link
Collaborator

@jmazanec15 We can publish after @pajuric updates the meta.

Signed-off-by: John Mazanec <[email protected]>
@jmazanec15
Copy link
Member Author

Thanks @pajuric - updated

@pajuric
Copy link

pajuric commented Feb 19, 2025

@nateynateynate @krisfreedain - New blog to publish today

Copy link
Member

@nateynateynate nateynateynate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@nateynateynate nateynateynate merged commit 9875b02 into opensearch-project:main Feb 19, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants