-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create CatMatrix from codes and categories #389
Create CatMatrix from codes and categories #389
Conversation
@lbittarello, if you have time I would love to get your feedback. It removes the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I feel that a separate constructor, e.g. CategoricalMatrix.from_codes
(like pandas.Categorical.from_codes
) might be a bit clearer than the meaning of cat_vec
being dependent on another argument, but I'm on board with this solution, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the option to create a categorical matrix from codes. Indeed, it was my preferred option. :)
However, this PR has an unintended consequence. If we use a Polars array to instantiate a categorical matrix, the cat
property returns a Polars array. If we then subset this categorical matrix, the subset will no longer remember that the original input was a Polars array, because the _input_dtype
attribute will be set to the dtype
of the array of categorical codes. At that point, cat
will return a Pandas series.
The _Categorical
container ensures that, no matter how many times we subset a categorical matrix, we'll always remember the input type and cat
will always return an appropriate series.
I'm not sure that we care about it. I added a deprecation warning to cat
to that we'd be able to take categorical codes in the future without these complications.
Alternatively, we could add an input_dtype
instantiation parameter for __getitem__
to set, but it seemed a strange attribute for users to have access to.
If we deprecate the |
Of course, I can add a more obvious error message instead of relying on python spitting out an error because it doesn't know what |
Would adding an |
That's fine by me. :) |
See this issue on Glum to understand the reasoning.
Checklist
CHANGELOG.rst
entry