Skip to content

Conversation

@sergioburdisso
Copy link

As it stands, the SKlearn.embed() method returns a scipy.sparse._csr.csr_matrix object, which cannot be used directly with np.average() when computing topic embeddings. This is because np.average() attempts to cast the input using np.asanyarray(), resulting in an array with an empty shape.

This fix ensures that the output is always a dense NumPy array by calling the .toarray() method on the csr_matrix object before passing it to np.average().

As is, the embed() method returns a `scipy.sparse._csr.csr_matrix` object which can't be used with `np.average()` directly that is used to get the topic embeddings. This is because `np.average()` try to cast the passed object using `np.asanyarray()` which returns it as an array with empty shape.

This fix simply uses the `.toarray()` method of `scipy.sparse._csr.csr_matrix` to return always a numpy array (dense).
@MaartenGr
Copy link
Owner

I am a bit hesitant to return a dense array, considering that it may explode the memory requirements when the initial array was sparse.

Could you also share the error that this PR is supposed to fix? That helps me understand how we can look for alternative fixes.

@sergioburdisso
Copy link
Author

I am a bit hesitant to return a dense array, considering that it may explode the memory requirements when the initial array was sparse.

I understand the concern. However, and please correct me if I’m misunderstanding the intended design, the sparse matrix in this context represents the topic-keyword matrix built by sklearn. Since this matrix is not based on the full vocabulary but only on the selected topic keywords, its dimensionality should remain relatively small (i.e., the union of all topic keyword sets). Under that assumption, converting this specific matrix to dense should not cause a memory issue, as the vocabulary for topic representations is intentionally compact.

Could you also share the error that this PR is supposed to fix? That helps me understand how we can look for alternative fixes.

Sure, here is the relevant traceback:

Traceback (most recent call last):
  File "__init__.py", line 228, in get_model
    topic_model.fit(documents=X, y=y)
  File ".local/lib/python3.10/site-packages/bertopic/_bertopic.py", line 387, in fit
    self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
  File ".local/lib/python3.10/site-packages/bertopic/_bertopic.py", line 519, in fit_transform
    documents = self._reduce_topics(documents)
  File ".local/lib/python3.10/site-packages/bertopic/_bertopic.py", line 4430, in _reduce_topics
    self._extract_topics(documents, verbose=self.verbose)
  File ".local/lib/python3.10/site-packages/bertopic/_bertopic.py", line 4056, in _extract_topics
    self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
  File ".local/lib/python3.10/site-packages/bertopic/_bertopic.py", line 4233, in _create_topic_vectors
    topic_embedding = np.average(
  File ".local/lib/python3.10/site-packages/numpy/lib/function_base.py", line 540, in average
    if wgt.shape[0] != a.shape[axis]:
IndexError: tuple index out of range

The root issue is that within self._create_topic_vectors(), NumPy’s np.average() receives topic embeddings that are of type scipy.sparse.csr_matrix. NumPy internally calls np.asanyarray() on these inputs, but for a CSR matrix this returns an array with an empty shape rather than a dense representation. This leads to a shape mismatch in the check wgt.shape[0] != a.shape[axis], which then triggers the IndexError.

Since CSR matrices are not compatible with NumPy’s reduction functions, converting them to dense with .toarray() resolves the issue, and this is precisely the conversion that CSR matrices are meant to support when dense operations are needed.

@MaartenGr
Copy link
Owner

Since this matrix is not based on the full vocabulary but only on the selected topic keywords, its dimensionality should remain relatively small (i.e., the union of all topic keyword sets).

Is it though? The scikit-learn backend returns a Bag-of-Words embedding for each document based on the full vocabulary of documents (assuming that a Bag-of-Words-like technique is used). So that would mean that it potentially is a very large sparse matrix that gets converted to a dense format.

@sergioburdisso
Copy link
Author

As I mentioned "please correct me if I’m misunderstanding the intended design", not sure about at this point if "full vocabulary" or only the terms associated to the topic, since it is used to represent the topic as a single embedding make sense it is based only on its keywords embeddings (not all the vocab), however not sure how you actually implemented it.

Anyways, full vocabulary or not, topic embedding is computed as the row-wise average of this matrix and since it is returned as sparse it fails, np.average expect a regular/dense matrix...

@MaartenGr
Copy link
Owner

As I mentioned "please correct me if I’m misunderstanding the intended design", not sure about at this point if "full vocabulary" or only the terms associated to the topic, since it is used to represent the topic as a single embedding make sense it is based only on its keywords embeddings (not all the vocab), however not sure how you actually implemented it.

If I'm not mistaken, the scikit-learn embedder will take all input documents and use the vocabulary across all of these documents to create the embeddings. As such, you typically have a very large sparse matrix when you create these embeddings.

It cannot be the case that only the terms associated to the topic are stored because that would mean that the embeddings of two topics will have different dimensions, which is not the case.

Anyways, full vocabulary or not, topic embedding is computed as the row-wise average of this matrix and since it is returned as sparse it fails, np.average expect a regular/dense matrix...

Well... it does matter quite a bit. Even if np.average expects a regular/dense matrix doesn't mean we should convert the sparse matrix to dense. The reason for that is that the entire vocabulary is used (which there is no way around that with Bag-of-Words-like techniques) and casting a sparse matrix to dense might result in Out-of-Memory errors.

Therefore, I would prefer a method that allows for averaging sparse matrices directly, perhaps by doing the averaging directly on the sparse matrices rather than using np.average.

@sergioburdisso
Copy link
Author

sergioburdisso commented Dec 3, 2025

If I'm not mistaken, the scikit-learn embedder will take all input documents and use the vocabulary across all of these documents to create the embeddings. As such, you typically have a very large sparse matrix when you create these embeddings.

You're right that using the full vocabulary across all documents leads to a very large sparse matrix.

However, for representing a topic, it might make more sense to build the embedding from just that topic’s keywords rather than from the full vocabulary. In the sparse-vector setting, this is equivalent to only considering the dimensions associated with those keywords instead of the full vocabulary (that is, when creating the sklearn representation pass as vocabulary the keywords, e.g. TfidfVectorizer(..., vocabulary=keywords), so the sparse matrix will be build from all its documents but only considering the keywords).

This way, the topic embedding becomes more informative, since it won't be diluted by less relevant or common words that appear across many topics. If you average over all terms in all documents, the resulting vector is influenced by a lot of generic words, which can blur the actual topic signal.

Therefore, I would prefer a method that allows for averaging sparse matrices directly, perhaps by doing the averaging directly on the sparse matrices rather than using np.average.

I agree that, when working with the full vocabulary, forcing the sparse matrix to become dense is not ideal and can easily lead to OOM issues. In that case, it’s definitely better to change the averaging logic:

  • if the data is sparse, do a sparse-aware average directly on the sparse matrix,
  • otherwise, fall back to np.average (or the usual dense operations).
    However, if the sparse matrix is represented as only relevant keywords, converting to dense should be OK too.

@MaartenGr
Copy link
Owner

However, for representing a topic, it might make more sense to build the embedding from just that topic’s keywords rather than from the full vocabulary. In the sparse-vector setting, this is equivalent to only considering the dimensions associated with those keywords instead of the full vocabulary (that is, when creating the sklearn representation pass as vocabulary the keywords, e.g. TfidfVectorizer(..., vocabulary=keywords), so the sparse matrix will be build from all its documents but only considering the keywords).

Just making sure I understand correctly here. If you would rebuild the topic embeddings based only on the keywords per topic by doing TfidfVectorizer(..., vocabulary=keywords), won't you get differently sized embeddings per topic when some topics might have more non-zero values for their bag-of-words than others? Also, even if they were to have the same size, then they couldn't be compared at all since every single value points to something else.

For instance, you would get something like this:

tfidf_embedding_topic_a = [2, 1, 1, 4]
keywords_topic_a = ["car", "tesla", "volvo", "engine"]

tfidf_embedding_topic_b = [1, 2, 1, 3]
keywords_topic_b = ["sports", "golf", "tennis", "football"]

In other words, you get mismatched vocabularies and for two topic embeddings (assuming they are Bag-of-Words embeddings) to be comparable, then they need to have the same vocabulary.

I agree that, when working with the full vocabulary, forcing the sparse matrix to become dense is not ideal and can easily lead to OOM issues. In that case, it’s definitely better to change the averaging logic:

Yes, that was indeed my suggestion. Seems like it should be rather straightforward to implement.

@sergioburdisso
Copy link
Author

sergioburdisso commented Dec 3, 2025

In other words, you get mismatched vocabularies and for two topic embeddings (assuming they are Bag-of-Words embeddings) to be comparable, then they need to have the same vocabulary.

Sorry, my bad, I meant TfidfVectorizer(..., vocabulary=union_of_all_keywords) not only target topic's keywords, as I posted in the original message:

dimensionality should remain relatively small (i.e., the union of all topic keyword sets).

That is, vocabulary = union of keywords of all topics (I said that the first time, precisely because of what you describe about each topic having different dimension / not being comparable)

@MaartenGr
Copy link
Owner

Although the dimensionality would be solved, the information that it contains wouldn't. By stripping out all information expect for the keywords, you are left with only with an extremely tiny portion of information you had before. I think this would hurt the predictive power of the embeddings, which are used to for the .transform step and therefore should have the same dimensionality as is generated for the input documents. Moreover, recalculating the c-TF-IDF representations is not trivial, espeically when you do not have access to the input documents (which is often the case when recalculating the topic embeddings based on the input TF-IDF representations).

All in all, although I agree with the sparse-aware averaging, there is just too much information (and functionality!) lost when attempting to reduce the dimensionality (or information structure) of the topic embeddings when compared to the input embeddings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants