Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for matryoshka indexing #131

Open
npip99 opened this issue Sep 5, 2024 · 0 comments
Open

Support for matryoshka indexing #131

npip99 opened this issue Sep 5, 2024 · 0 comments

Comments

@npip99
Copy link

npip99 commented Sep 5, 2024

CREATE INDEX ix_chunk_embedding
ON chunk USING diskann (embedding) WITH (num_dimensions=1999);
NOTICE:  Starting index build. num_neighbors=-1 search_list_size=100, max_alpha=1.2, storage_layout=SbqCompression
ERROR:  assertion failed: dimensions > 0 && dimensions < 2000

The error above is a bit of a shame.

I understanding putting a hard limit on the index dimension, since it totally changes the entire search process. But, if my vector is a Vector(3072), it would be nice to support matryoshka by allowing the dimension of the index to be < 2000, even if the source vector has a larger dimension. I believe the above SQL code should execute successfully, since I'm only indexing a subvector of the original vector.

For now, I have a generated column and calculate it based on my desired subvector, but this takes physical space on disk, when ideally it should be computed on the fly. And, it means that I have to rerank manually by the full vector, rather than the index automatically handling it (Not a big deal).

If it could support e.g. this notation, then the num_dimensions attribute wouldn't be necessary anymore, and solve both problems (But I think supporting that notation might be overkill, I'm not sure).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant