Support for matryoshka indexing #131

npip99 · 2024-09-05T21:50:47Z

CREATE INDEX ix_chunk_embedding
ON chunk USING diskann (embedding) WITH (num_dimensions=1999);

NOTICE:  Starting index build. num_neighbors=-1 search_list_size=100, max_alpha=1.2, storage_layout=SbqCompression
ERROR:  assertion failed: dimensions > 0 && dimensions < 2000

The error above is a bit of a shame.

I understanding putting a hard limit on the index dimension, since it totally changes the entire search process. But, if my vector is a Vector(3072), it would be nice to support matryoshka by allowing the dimension of the index to be < 2000, even if the source vector has a larger dimension. I believe the above SQL code should execute successfully, since I'm only indexing a subvector of the original vector.

For now, I have a generated column and calculate it based on my desired subvector, but this takes physical space on disk, when ideally it should be computed on the fly. And, it means that I have to rerank manually by the full vector, rather than the index automatically handling it (Not a big deal).

If it could support e.g. this notation, then the num_dimensions attribute wouldn't be necessary anymore, and solve both problems (But I think supporting that notation might be overkill, I'm not sure).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for matryoshka indexing #131

Support for matryoshka indexing #131

npip99 commented Sep 5, 2024 •

edited

Loading

Support for matryoshka indexing #131

Support for matryoshka indexing #131

Comments

npip99 commented Sep 5, 2024 • edited Loading

npip99 commented Sep 5, 2024 •

edited

Loading