Description
In my opinion, sparse vectors can solve two different problems:
- Text search (TFIDF, BM25, SPLADE, etc.)
- Weighted-keywords search
In both cases, we can have a sparse vector representation:
- Text:
{'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}
- Weighted-keywords:
{"Dog": 0.4, "Cat": 0.3, "Giant panda": 0.1, "Komodo dragon": 0.05}
However, the use cases are different:
Text Search
For text search, sparse vectors like BM25 are a good representation. For example, if I look for the query:
BM25 vs SPLADE
This will be the tokens:
["BM", "25", "vs", "SPLADE"]
It is acceptable to return documents that contain a subset of these tokens, such as:
{'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}
Weighted-Keywords Search
Now imagine that I have a collection of documents (images or texts), each annotated with animal keywords and their corresponding probabilities:
DOC A: {"Dog": 0.43}
DOC B: {"Cat": 0.21}
DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC D: {"Giant panda": 0.1}
DOC E: {"Dog": 0.33, "Cat": 0.66}
If I perform a query with the keyword ["Dog"]
, I want the following results (sorted by inner product distance):
DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC A: {"Dog": 0.43}
DOC E: {"Dog": 0.33, "Cat": 0.66}
However, if I query with the keywords ["Dog", "Cat"]
, it is not acceptable to return documents with only a subset of these keywords because I want documents containing all the keywords (similar to the PostgreSQL @>
operator). The sorting/ranking should then be done by inner product distance:
DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC E: {"Dog": 0.33, "Cat": 0.66}
Final Notes
The plain keywords search in PostgreSQL can be achieved with the following syntax and operators:
WHERE query_keywords && ARRAY['keyword1', 'keyword2', 'keyword3']; ---- OR between keywords
WHERE query_keywords @> ARRAY['keyword1', 'keyword2', 'keyword3']; ---- AND between keywords
This type of data can also be accelerated with an Inverted Index in PostgreSQL: GIN.
My final question is whether we can achieve this Weighted-keywords search in pgvector
; perhaps with new operators like &&
and @>
.
Some other vector databases, like Milvus, already include Inverted Indexes for dealing with sparse vectors: