-
Notifications
You must be signed in to change notification settings - Fork 870
Make SklearnEmbedder return a dense array
#2454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
As is, the embed() method returns a `scipy.sparse._csr.csr_matrix` object which can't be used with `np.average()` directly that is used to get the topic embeddings. This is because `np.average()` try to cast the passed object using `np.asanyarray()` which returns it as an array with empty shape. This fix simply uses the `.toarray()` method of `scipy.sparse._csr.csr_matrix` to return always a numpy array (dense).
|
I am a bit hesitant to return a dense array, considering that it may explode the memory requirements when the initial array was sparse. Could you also share the error that this PR is supposed to fix? That helps me understand how we can look for alternative fixes. |
I understand the concern. However, and please correct me if I’m misunderstanding the intended design, the sparse matrix in this context represents the topic-keyword matrix built by sklearn. Since this matrix is not based on the full vocabulary but only on the selected topic keywords, its dimensionality should remain relatively small (i.e., the union of all topic keyword sets). Under that assumption, converting this specific matrix to dense should not cause a memory issue, as the vocabulary for topic representations is intentionally compact.
Sure, here is the relevant traceback: Traceback (most recent call last):
File "__init__.py", line 228, in get_model
topic_model.fit(documents=X, y=y)
File ".local/lib/python3.10/site-packages/bertopic/_bertopic.py", line 387, in fit
self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
File ".local/lib/python3.10/site-packages/bertopic/_bertopic.py", line 519, in fit_transform
documents = self._reduce_topics(documents)
File ".local/lib/python3.10/site-packages/bertopic/_bertopic.py", line 4430, in _reduce_topics
self._extract_topics(documents, verbose=self.verbose)
File ".local/lib/python3.10/site-packages/bertopic/_bertopic.py", line 4056, in _extract_topics
self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
File ".local/lib/python3.10/site-packages/bertopic/_bertopic.py", line 4233, in _create_topic_vectors
topic_embedding = np.average(
File ".local/lib/python3.10/site-packages/numpy/lib/function_base.py", line 540, in average
if wgt.shape[0] != a.shape[axis]:
IndexError: tuple index out of rangeThe root issue is that within Since CSR matrices are not compatible with NumPy’s reduction functions, converting them to dense with |
Is it though? The scikit-learn backend returns a Bag-of-Words embedding for each document based on the full vocabulary of documents (assuming that a Bag-of-Words-like technique is used). So that would mean that it potentially is a very large sparse matrix that gets converted to a dense format. |
|
As I mentioned "please correct me if I’m misunderstanding the intended design", not sure about at this point if "full vocabulary" or only the terms associated to the topic, since it is used to represent the topic as a single embedding make sense it is based only on its keywords embeddings (not all the vocab), however not sure how you actually implemented it. Anyways, full vocabulary or not, topic embedding is computed as the row-wise average of this matrix and since it is returned as sparse it fails, |
If I'm not mistaken, the scikit-learn embedder will take all input documents and use the vocabulary across all of these documents to create the embeddings. As such, you typically have a very large sparse matrix when you create these embeddings. It cannot be the case that only the terms associated to the topic are stored because that would mean that the embeddings of two topics will have different dimensions, which is not the case.
Well... it does matter quite a bit. Even if np.average expects a regular/dense matrix doesn't mean we should convert the sparse matrix to dense. The reason for that is that the entire vocabulary is used (which there is no way around that with Bag-of-Words-like techniques) and casting a sparse matrix to dense might result in Out-of-Memory errors. Therefore, I would prefer a method that allows for averaging sparse matrices directly, perhaps by doing the averaging directly on the sparse matrices rather than using np.average. |
You're right that using the full vocabulary across all documents leads to a very large sparse matrix. However, for representing a topic, it might make more sense to build the embedding from just that topic’s keywords rather than from the full vocabulary. In the sparse-vector setting, this is equivalent to only considering the dimensions associated with those keywords instead of the full vocabulary (that is, when creating the sklearn representation pass as vocabulary the keywords, e.g. This way, the topic embedding becomes more informative, since it won't be diluted by less relevant or common words that appear across many topics. If you average over all terms in all documents, the resulting vector is influenced by a lot of generic words, which can blur the actual topic signal.
I agree that, when working with the full vocabulary, forcing the sparse matrix to become dense is not ideal and can easily lead to OOM issues. In that case, it’s definitely better to change the averaging logic:
|
Just making sure I understand correctly here. If you would rebuild the topic embeddings based only on the keywords per topic by doing For instance, you would get something like this: tfidf_embedding_topic_a = [2, 1, 1, 4]
keywords_topic_a = ["car", "tesla", "volvo", "engine"]
tfidf_embedding_topic_b = [1, 2, 1, 3]
keywords_topic_b = ["sports", "golf", "tennis", "football"]In other words, you get mismatched vocabularies and for two topic embeddings (assuming they are Bag-of-Words embeddings) to be comparable, then they need to have the same vocabulary.
Yes, that was indeed my suggestion. Seems like it should be rather straightforward to implement. |
Sorry, my bad, I meant
That is, vocabulary = union of keywords of all topics (I said that the first time, precisely because of what you describe about each topic having different dimension / not being comparable) |
|
Although the dimensionality would be solved, the information that it contains wouldn't. By stripping out all information expect for the keywords, you are left with only with an extremely tiny portion of information you had before. I think this would hurt the predictive power of the embeddings, which are used to for the All in all, although I agree with the sparse-aware averaging, there is just too much information (and functionality!) lost when attempting to reduce the dimensionality (or information structure) of the topic embeddings when compared to the input embeddings. |
As it stands, the
SKlearn.embed()method returns ascipy.sparse._csr.csr_matrixobject, which cannot be used directly withnp.average()when computing topic embeddings. This is becausenp.average()attempts to cast the input usingnp.asanyarray(), resulting in an array with an empty shape.This fix ensures that the output is always a dense NumPy array by calling the
.toarray()method on thecsr_matrixobject before passing it tonp.average().