-
Notifications
You must be signed in to change notification settings - Fork 952
Support polars.Expr.value_counts
in cudf_polars
#19079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support polars.Expr.value_counts
in cudf_polars
#19079
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Partially broken off from #19075 and #19079. When eventually exporting `to_polars`, we'll need `Column`s to preserve the `DataType` container which contains the Polars datatype that may contain struct fields (xref #16725). This PR only passes along the `DataType` objects and does not materially use them. A possible eventual goal is to make `Column` require a `DataType` object in the constructor (xref https://github.com/rapidsai/cudf/pull/19075/files#r2126917015). A follow up PR will need to address `DataFrame.from_table` where we do not pass along `DataType` objects yet Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Tom Augspurger (https://github.com/TomAugspurger) URL: #19091
…olars/value_counts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mroeschke. I think this is good to go, but a couple questions about the tests. Feel free to ignore them and merge if you think we're already covered.
@@ -235,6 +236,55 @@ def do_evaluate( | |||
order=order, | |||
null_order=null_order, | |||
) | |||
elif self.name == "value_counts": | |||
(sort, parallel, name, normalize) = self.options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any policy for expression keywords that we don't use? I think just ignoring them (like we do here) is appropriate, but I wanted to confirm that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see some occurrences of _
used which I assume means unused variables, so I can change to use this
total_counts = plc.reduce.reduce( | ||
counts_col, plc.aggregation.sum(), plc.DataType(plc.TypeId.UINT64) | ||
) | ||
counts_col = plc.binaryop.binary_operation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential edge case (divide by zero?):
In [6]: pl.LazyFrame({"a": []}).select(pl.col('a').value_counts(normalize=True)).collect()
Out[6]:
shape: (0, 1)
┌───────────┐
│ a │
│ --- │
│ struct[2] │
╞═══════════╡
└───────────┘
Do we do the right thing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup appears we do, good idea to check
In [2]: pl.LazyFrame({"a": []}).select(pl.col('a').value_counts(normalize=True)).collect(engine="gpu")
Out[2]:
shape: (0, 1)
┌───────────┐
│ a │
│ --- │
│ struct[2] │
╞═══════════╡
└───────────┘
In [3]: pl.LazyFrame({"a": []}).select(pl.col('a').value_counts(normalize=True)).collect(engine="gpu").dtypes
Out[3]: [Struct({'a': Null, 'proportion': Float64})]
In [4]: pl.LazyFrame({"a": []}).select(pl.col('a').value_counts(normalize=True)).collect().dtypes
Out[4]: [Struct({'a': Null, 'proportion': Float64})]
elif counts_col.type().id() == plc.TypeId.INT32: | ||
counts_col = plc.unary.cast(counts_col, plc.DataType(plc.TypeId.UINT32)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like #15852 would be helpful here too.
for child in request.value.children | ||
): | ||
raise NotImplementedError( | ||
"value_counts is not supported in groupby" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a mix of tests that assert we don't support some options, and pragma: no cover
for unsupported operations.
Are folks OK with this, or do we want explicit tests for things we don't support? I don't have a strong preference either way. An argument in favor of having a test is that you ensure the if
condition is set appropriately (and this one looks kinda complicated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or do we want explicit tests for things we don't support?
Yeah I suppose we should have our own tests for these operations we don't support, but it's opaquely covered by running the Polars unit tests.
I can add a test for this specific case in #19193
Will address the follow ups here in #19193 |
/merge |
Description
Towards #16725
Depends on #19091
Checklist