-
Notifications
You must be signed in to change notification settings - Fork 13
feat: Extend allowed types of the categories
arg of dy.Enum
#138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #138 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 49 49
Lines 2812 2818 +6
=========================================
+ Hits 2812 2818 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
categories
arg of dy.Enum
categories
arg of dy.Enum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat, looks great, thanks @jjurm! I think it's fine to represent the categories as series, I can't come up with a downside and we mirror the polars API this way. Would love to hear a second opinion on this by @AndreasAlbertQC or @delsner though 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @jjurm! Personally, I'd have gone for list
, but I also don't think it's super important 🤷
values=[getattr(v, "value", v) for v in categories.__members__.values()] | ||
) | ||
elif not isinstance(categories, pl.Series): | ||
categories = pl.Series(values=categories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take is that I would not use a polars object when I can just use vanilla python without drawbacks, which seems to be the case here. 🤷 Why not just list
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the reason is to mirror Polars, so we can pass categories around without intermediate conversions, e.g.:
- A call to the
.dtype
property creates thepl.Enum
type which would convert lists to Series on every call. This conversion is skipped when the categories already are apl.Series
. - Storing categories as a Series allows us to defer the category comparison logic to Polars, therefore ensuring a more consistent behavior with Polars in the future (sure, this is more of a cosmetic reason)
@@ -72,7 +81,7 @@ def dtype(self) -> pl.DataType: | |||
def validate_dtype(self, dtype: PolarsDataType) -> bool: | |||
if not isinstance(dtype, pl.Enum): | |||
return False | |||
return self.categories == dtype.categories.to_list() | |||
return self.categories.equals(dtype.categories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not need to change in this PR, but while we are talking: I always feel like this should be a set comparison so that enums with the same categories but different order are considered the same. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum uses category order for comparisons and sorts, so two Enums with the same labels but different order are not equivalent. Therefore I’d keep it order-sensitive on purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I tend to think of enums just in terms of "what values are allowed" and never in terms of sorting, but that might be my personal bias :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, Polars Enum is also order-sensitive.
@@ -22,7 +24,7 @@ class Enum(Column): | |||
|
|||
def __init__( | |||
self, | |||
categories: Sequence[str], | |||
categories: pl.Series | Iterable[str] | type[enum.Enum], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO fine to accept Series here, independently of whether we use it internally as well 👍
Motivation
This makes the
categories
argument accept the same types as inpl.Enum
, making it easier to convert from polars dtypes to dataframely dtypes.Fixes #136.
Changes
dy.Enum.categories
is now stored as apl.Series
- happy to think of another solution if you don't think we should convert all sequences to series. However, this is what polars does internally anyway oncedy.Enum.dtype
is called.