Skip to content

Conversation

jjurm
Copy link
Contributor

@jjurm jjurm commented Sep 5, 2025

Motivation

This makes the categories argument accept the same types as in pl.Enum, making it easier to convert from polars dtypes to dataframely dtypes.

Fixes #136.

Changes

dy.Enum.categories is now stored as a pl.Series - happy to think of another solution if you don't think we should convert all sequences to series. However, this is what polars does internally anyway once dy.Enum.dtype is called.

@jjurm jjurm added the enhancement New feature or request label Sep 5, 2025
Copy link

codecov bot commented Sep 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (f303b3b) to head (3c03c14).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #138   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           49        49           
  Lines         2812      2818    +6     
=========================================
+ Hits          2812      2818    +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jjurm jjurm changed the title Extend allowed types of the categories arg of dy.Enum feat: Extend allowed types of the categories arg of dy.Enum Sep 5, 2025
@jjurm jjurm marked this pull request as ready for review September 5, 2025 16:43
Copy link
Member

@borchero borchero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, looks great, thanks @jjurm! I think it's fine to represent the categories as series, I can't come up with a downside and we mirror the polars API this way. Would love to hear a second opinion on this by @AndreasAlbertQC or @delsner though 😄

Copy link
Collaborator

@AndreasAlbertQC AndreasAlbertQC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @jjurm! Personally, I'd have gone for list, but I also don't think it's super important 🤷

values=[getattr(v, "value", v) for v in categories.__members__.values()]
)
elif not isinstance(categories, pl.Series):
categories = pl.Series(values=categories)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take is that I would not use a polars object when I can just use vanilla python without drawbacks, which seems to be the case here. 🤷 Why not just list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the reason is to mirror Polars, so we can pass categories around without intermediate conversions, e.g.:

  1. A call to the .dtype property creates the pl.Enum type which would convert lists to Series on every call. This conversion is skipped when the categories already are a pl.Series.
  2. Storing categories as a Series allows us to defer the category comparison logic to Polars, therefore ensuring a more consistent behavior with Polars in the future (sure, this is more of a cosmetic reason)

@@ -72,7 +81,7 @@ def dtype(self) -> pl.DataType:
def validate_dtype(self, dtype: PolarsDataType) -> bool:
if not isinstance(dtype, pl.Enum):
return False
return self.categories == dtype.categories.to_list()
return self.categories.equals(dtype.categories)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not need to change in this PR, but while we are talking: I always feel like this should be a set comparison so that enums with the same categories but different order are considered the same. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum uses category order for comparisons and sorts, so two Enums with the same labels but different order are not equivalent. Therefore I’d keep it order-sensitive on purpose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I tend to think of enums just in terms of "what values are allowed" and never in terms of sorting, but that might be my personal bias :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, Polars Enum is also order-sensitive.

@@ -22,7 +24,7 @@ class Enum(Column):

def __init__(
self,
categories: Sequence[str],
categories: pl.Series | Iterable[str] | type[enum.Enum],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO fine to accept Series here, independently of whether we use it internally as well 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Enum classes as categories for pl.Enum
3 participants