-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to SplitMatrix #91
Conversation
- Allow SplitMatrix to be constructed from another SplitMatrix. - Allow inputs of SplitMatrix to be 1-d - Implement __getitem__ for column subset - Also had to implement column subsetting for CategoricalMatrix - __repr__ uses the __repr__ function of components instead of str()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
colmap[idx] = [i, j] | ||
return colmap | ||
|
||
def _split_col_subsets_unordered(self, cols): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems occasionally to return empty lists when one indexes columns with a list, which causes is_sorted
to throw an error later.
MRE:
import pandas as pd
import quantcore.matrix as mx
df = pd.DataFrame({"u": ["a", "b"], "v": ["a", "b"]})
X = mx.from_pandas(df, cat_threshold=False, object_as_cat=True)
X[:, [0, 1]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't replicate this on macOS. I assume you tried this on Windows. Is that correct? Can you give me some info on your setup?
Also, you said "occasionally". Does it always throw an error when you try your MRE or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it always throw an error when you try your MRE or not?
Yes. But it doesn't throw an error if we reduce the number of rows from two to one, for example.
I assume you tried this on Windows. Is that correct?
Yes. I'm not embarrassed.
Can you give me some info on your setup?
python : 3.8.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 23 Model 96 Stepping 1, AuthenticAMD
byteorder : little
pandas : 1.3.0
numpy : 1.20.3
Is this helpful? Do you need something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we perhaps add unit tests for indexing? The Windows CI could come in handy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unit testing will be very valuable. I'll temporarily add the Windows CI to this PR for all the pushes. And yes, this is helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 on indexing unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is huge!! Thanks Marc for making a really big improvement to SplitMatrix. Let me know if there are any pieces that you would like to hand off. It feels like you started what seemed like a small project and it's creeped outwards a couple times now. So, don't feel like you're committed and stuck finishing this if you have other important stuff going on.
return CategoricalMatrix(self.cat[row]) | ||
else: | ||
# return a SparseMatrix if we subset columns | ||
return SparseMatrix(self.tocsr()[row, col], dtype=self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite inefficient because we construct the full sparse matrix with self.tocsr()
before subsetting it. I'm fine leaving it like this for now, but I think it'd be good to at least leave a TODO comment or add an issue mentioning this performance bug. Fixing this for the single element case is quite easy. For the [:, cols]
case, I guess we need to construct a sparse matrix element by element and I'm guessing that it'll be easiest to do that in Cython.
colmap[idx] = [i, j] | ||
return colmap | ||
|
||
def _split_col_subsets_unordered(self, cols): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 on indexing unit tests.
added docstring Co-authored-by: Ben Thompson <[email protected]>
This is a big commit with many changes: - partial support of integer indexing - removed advanced indexing for densematrix and sparsematrix - ensure that indexing of splitmatrix generates basematrix type - partial fix of standardizedmatrix indexing - added indexing tests (currently fails)
Quick thought: it looks like several loosely related updates within one PR. |
Great idea. There are finished features that I would like to merge soon so I'll create more focused PRs with them. |
Yes, being able to push some of the changes independently of others is one advantage. |
closing. Most changes were implemented in other PRs and we will clearly take another approach to dealing with this. |
__getitem__
for column subset__repr__
uses the__repr__
method of components instead of str()ToDo:
_split_col_subsets
(first confirm that it's a bug)Checklist
CHANGELOG.rst
entry