Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create CatMatrix from codes and categories #389

Merged

Conversation

MarcAntoineSchmidtQC
Copy link
Member

See this issue on Glum to understand the reasoning.

Checklist

  • Added a CHANGELOG.rst entry

@MarcAntoineSchmidtQC
Copy link
Member Author

@lbittarello, if you have time I would love to get your feedback. It removes the _Categorical class that you added during the polars PR.

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC marked this pull request as ready for review September 12, 2024 18:32
Copy link
Collaborator

@stanmart stanmart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I feel that a separate constructor, e.g. CategoricalMatrix.from_codes (like pandas.Categorical.from_codes) might be a bit clearer than the meaning of cat_vec being dependent on another argument, but I'm on board with this solution, too.

Copy link
Member

@lbittarello lbittarello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the option to create a categorical matrix from codes. Indeed, it was my preferred option. :)

However, this PR has an unintended consequence. If we use a Polars array to instantiate a categorical matrix, the cat property returns a Polars array. If we then subset this categorical matrix, the subset will no longer remember that the original input was a Polars array, because the _input_dtype attribute will be set to the dtype of the array of categorical codes. At that point, cat will return a Pandas series.

The _Categorical container ensures that, no matter how many times we subset a categorical matrix, we'll always remember the input type and cat will always return an appropriate series.

I'm not sure that we care about it. I added a deprecation warning to cat to that we'd be able to take categorical codes in the future without these complications.

Alternatively, we could add an input_dtype instantiation parameter for __getitem__ to set, but it seemed a strange attribute for users to have access to.

@MarcAntoineSchmidtQC
Copy link
Member Author

If we deprecate the cat property, I don't think we should be thinking about expanding its scope. Polars support is not released yet, so in the stable version cat is a pandas.Categorical. I would be happy to simply return a pandas.Categorical when possible and otherwise raise an error saying that this method is not supported with a non-pandas backend. This is fully backward compatible.

@MarcAntoineSchmidtQC
Copy link
Member Author

Of course, I can add a more obvious error message instead of relying on python spitting out an error because it doesn't know what pd is.

@stanmart
Copy link
Collaborator

Would adding an _input_dtype argument to CategoricalMatrix's constructor solve this? __getitem__ sees self and can pass this property to the newly created matrix.

@lbittarello
Copy link
Member

I would be happy to simply return a pandas.Categorical when possible and otherwise raise an error saying that this method is not supported with a non-pandas backend.

That's fine by me. :)

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC merged commit 8b1ba87 into main Sep 20, 2024
21 of 22 checks passed
@MarcAntoineSchmidtQC MarcAntoineSchmidtQC deleted the Create-CategoricalMatrix-from-codes-and-categories branch September 20, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants