Sparse vectors: 1 representation, 2 use cases

In my opinion, sparse vectors can solve two different problems:

1. Text search (TFIDF, BM25, SPLADE, etc.)
2. Weighted-keywords search

In both cases, we can have a sparse vector representation:

1. Text: `{'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}`
2. Weighted-keywords: `{"Dog": 0.4, "Cat": 0.3, "Giant panda": 0.1, "Komodo dragon": 0.05}`

However, the use cases are different:

### Text Search

For text search, sparse vectors like BM25 are a good representation. For example, if I look for the query:

`BM25 vs SPLADE`

This will be the tokens:

`["BM", "25", "vs", "SPLADE"]`

It is acceptable to return documents that **contain a subset** of these tokens, such as:

`{'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}`

### Weighted-Keywords Search

Now imagine that I have a collection of documents (images or texts), each annotated with animal keywords and their corresponding probabilities:

- `DOC A: {"Dog": 0.43}`
- `DOC B: {"Cat": 0.21}`
- `DOC C: {"Dog": 0.65, "Cat": 0.11}`
- `DOC D: {"Giant panda": 0.1}`
- `DOC E: {"Dog": 0.33, "Cat": 0.66}`

If I perform a query with the keyword `["Dog"]`, I want the following results (sorted by inner product distance):

1. `DOC C: {"Dog": 0.65, "Cat": 0.11}`
2. `DOC A: {"Dog": 0.43}`
3. `DOC E: {"Dog": 0.33, "Cat": 0.66}`

However, if I query with the keywords `["Dog", "Cat"]`, it is **not acceptable** to return documents with only a subset of these keywords because I want documents containing all the keywords (similar to the PostgreSQL `@>` operator). The sorting/ranking should then be done by inner product distance:

1. `DOC C: {"Dog": 0.65, "Cat": 0.11}`
2. `DOC E: {"Dog": 0.33, "Cat": 0.66}`

## Final Notes

The plain keywords search in PostgreSQL can be achieved with the following syntax and operators:

```sql
WHERE query_keywords && ARRAY['keyword1', 'keyword2', 'keyword3'];  ---- OR between keywords
WHERE query_keywords @> ARRAY['keyword1', 'keyword2', 'keyword3'];  ---- AND between keywords
```

This type of data can also be accelerated with an Inverted Index in PostgreSQL: [GIN](https://www.postgresql.org/docs/17/gin.html).

My final question is whether we can achieve this Weighted-keywords search in `pgvector`; perhaps with new operators like `&&` and `@>`.

Some other vector databases, like Milvus, already include Inverted Indexes for dealing with sparse vectors:

![Milvus Sparse Vector Index Example](https://github.com/user-attachments/assets/20adda21-1c19-403b-b6d2-7ce63ffe6f97)

[Source](https://milvus.io/docs/sparse_vector.md#Index-the-collection)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparse vectors: 1 representation, 2 use cases #587

Text Search

Weighted-Keywords Search

Final Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sparse vectors: 1 representation, 2 use cases #587

Description

Text Search

Weighted-Keywords Search

Final Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions