Skip to content

Sparse vectors: 1 representation, 2 use cases #587

Open
@javiabellan

Description

@javiabellan

In my opinion, sparse vectors can solve two different problems:

  1. Text search (TFIDF, BM25, SPLADE, etc.)
  2. Weighted-keywords search

In both cases, we can have a sparse vector representation:

  1. Text: {'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}
  2. Weighted-keywords: {"Dog": 0.4, "Cat": 0.3, "Giant panda": 0.1, "Komodo dragon": 0.05}

However, the use cases are different:

Text Search

For text search, sparse vectors like BM25 are a good representation. For example, if I look for the query:

BM25 vs SPLADE

This will be the tokens:

["BM", "25", "vs", "SPLADE"]

It is acceptable to return documents that contain a subset of these tokens, such as:

{'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}

Weighted-Keywords Search

Now imagine that I have a collection of documents (images or texts), each annotated with animal keywords and their corresponding probabilities:

  • DOC A: {"Dog": 0.43}
  • DOC B: {"Cat": 0.21}
  • DOC C: {"Dog": 0.65, "Cat": 0.11}
  • DOC D: {"Giant panda": 0.1}
  • DOC E: {"Dog": 0.33, "Cat": 0.66}

If I perform a query with the keyword ["Dog"], I want the following results (sorted by inner product distance):

  1. DOC C: {"Dog": 0.65, "Cat": 0.11}
  2. DOC A: {"Dog": 0.43}
  3. DOC E: {"Dog": 0.33, "Cat": 0.66}

However, if I query with the keywords ["Dog", "Cat"], it is not acceptable to return documents with only a subset of these keywords because I want documents containing all the keywords (similar to the PostgreSQL @> operator). The sorting/ranking should then be done by inner product distance:

  1. DOC C: {"Dog": 0.65, "Cat": 0.11}
  2. DOC E: {"Dog": 0.33, "Cat": 0.66}

Final Notes

The plain keywords search in PostgreSQL can be achieved with the following syntax and operators:

WHERE query_keywords && ARRAY['keyword1', 'keyword2', 'keyword3'];  ---- OR between keywords
WHERE query_keywords @> ARRAY['keyword1', 'keyword2', 'keyword3'];  ---- AND between keywords

This type of data can also be accelerated with an Inverted Index in PostgreSQL: GIN.

My final question is whether we can achieve this Weighted-keywords search in pgvector; perhaps with new operators like && and @>.

Some other vector databases, like Milvus, already include Inverted Indexes for dealing with sparse vectors:

Milvus Sparse Vector Index Example

Source

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions