Skip to content

Support polars.Expr.value_counts in cudf_polars #19079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mroeschke
Copy link
Contributor

@mroeschke mroeschke commented Jun 3, 2025

Description

Towards #16725

Depends on #19091

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke self-assigned this Jun 3, 2025
@mroeschke mroeschke requested a review from a team as a code owner June 3, 2025 22:15
@mroeschke mroeschke requested review from bdice and Matt711 June 3, 2025 22:15
@mroeschke mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cudf-polars Issues specific to cudf-polars labels Jun 3, 2025
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jun 3, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Jun 3, 2025
@mroeschke mroeschke marked this pull request as draft June 6, 2025 18:31
Copy link

copy-pr-bot bot commented Jun 6, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rapids-bot bot pushed a commit that referenced this pull request Jun 12, 2025
Partially broken off from #19075 and #19079.

When eventually exporting `to_polars`, we'll need `Column`s to preserve the `DataType` container which contains the Polars datatype that may contain struct fields (xref #16725). This PR only passes along the `DataType` objects and does not materially use them.

A possible eventual goal is to make `Column` require a `DataType` object in the constructor (xref https://github.com/rapidsai/cudf/pull/19075/files#r2126917015). A follow up PR will need to address `DataFrame.from_table` where we do not pass along `DataType` objects yet

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Tom Augspurger (https://github.com/TomAugspurger)

URL: #19091
@mroeschke mroeschke marked this pull request as ready for review June 12, 2025 16:49
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mroeschke. I think this is good to go, but a couple questions about the tests. Feel free to ignore them and merge if you think we're already covered.

@@ -235,6 +236,55 @@ def do_evaluate(
order=order,
null_order=null_order,
)
elif self.name == "value_counts":
(sort, parallel, name, normalize) = self.options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any policy for expression keywords that we don't use? I think just ignoring them (like we do here) is appropriate, but I wanted to confirm that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some occurrences of _ used which I assume means unused variables, so I can change to use this

total_counts = plc.reduce.reduce(
counts_col, plc.aggregation.sum(), plc.DataType(plc.TypeId.UINT64)
)
counts_col = plc.binaryop.binary_operation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential edge case (divide by zero?):

In [6]: pl.LazyFrame({"a": []}).select(pl.col('a').value_counts(normalize=True)).collect()
Out[6]: 
shape: (0, 1)
┌───────────┐
│ a         │
│ ---       │
│ struct[2] │
╞═══════════╡
└───────────┘

Do we do the right thing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup appears we do, good idea to check

In [2]: pl.LazyFrame({"a": []}).select(pl.col('a').value_counts(normalize=True)).collect(engine="gpu")
Out[2]: 
shape: (0, 1)
┌───────────┐
│ a         │
│ ---       │
│ struct[2] │
╞═══════════╡
└───────────┘

In [3]: pl.LazyFrame({"a": []}).select(pl.col('a').value_counts(normalize=True)).collect(engine="gpu").dtypes
Out[3]: [Struct({'a': Null, 'proportion': Float64})]

In [4]: pl.LazyFrame({"a": []}).select(pl.col('a').value_counts(normalize=True)).collect().dtypes
Out[4]: [Struct({'a': Null, 'proportion': Float64})]

Comment on lines +275 to +276
elif counts_col.type().id() == plc.TypeId.INT32:
counts_col = plc.unary.cast(counts_col, plc.DataType(plc.TypeId.UINT32))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like #15852 would be helpful here too.

for child in request.value.children
):
raise NotImplementedError(
"value_counts is not supported in groupby"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a mix of tests that assert we don't support some options, and pragma: no cover for unsupported operations.

Are folks OK with this, or do we want explicit tests for things we don't support? I don't have a strong preference either way. An argument in favor of having a test is that you ensure the if condition is set appropriately (and this one looks kinda complicated).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or do we want explicit tests for things we don't support?

Yeah I suppose we should have our own tests for these operations we don't support, but it's opaquely covered by running the Polars unit tests.

I can add a test for this specific case in #19193

@mroeschke
Copy link
Contributor Author

Will address the follow ups here in #19193

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 2fb78b3 into rapidsai:branch-25.08 Jun 18, 2025
91 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Jun 18, 2025
@mroeschke mroeschke deleted the feat/cudf_polars/value_counts branch June 18, 2025 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants