Support `polars.Expr.value_counts` in `cudf_polars` #19079

mroeschke · 2025-06-03T22:15:49Z

Description

Towards #16725

Depends on #19091

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…olars/value_counts

copy-pr-bot · 2025-06-06T18:31:24Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Partially broken off from #19075 and #19079. When eventually exporting `to_polars`, we'll need `Column`s to preserve the `DataType` container which contains the Polars datatype that may contain struct fields (xref #16725). This PR only passes along the `DataType` objects and does not materially use them. A possible eventual goal is to make `Column` require a `DataType` object in the constructor (xref https://github.com/rapidsai/cudf/pull/19075/files#r2126917015). A follow up PR will need to address `DataFrame.from_table` where we do not pass along `DataType` objects yet Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Tom Augspurger (https://github.com/TomAugspurger) URL: #19091

…olars/value_counts

…oeschke/cudf into feat/cudf_polars/value_counts

…olars/value_counts

TomAugspurger

Thanks @mroeschke. I think this is good to go, but a couple questions about the tests. Feel free to ignore them and merge if you think we're already covered.

TomAugspurger · 2025-06-18T16:03:45Z

python/cudf_polars/cudf_polars/dsl/expressions/unary.py

@@ -235,6 +236,55 @@ def do_evaluate(
                order=order,
                null_order=null_order,
            )
+        elif self.name == "value_counts":
+            (sort, parallel, name, normalize) = self.options


Do we have any policy for expression keywords that we don't use? I think just ignoring them (like we do here) is appropriate, but I wanted to confirm that.

I see some occurrences of _ used which I assume means unused variables, so I can change to use this

python/cudf_polars/cudf_polars/dsl/expressions/unary.py

TomAugspurger · 2025-06-18T16:09:44Z

python/cudf_polars/cudf_polars/dsl/expressions/unary.py

+                total_counts = plc.reduce.reduce(
+                    counts_col, plc.aggregation.sum(), plc.DataType(plc.TypeId.UINT64)
+                )
+                counts_col = plc.binaryop.binary_operation(


Potential edge case (divide by zero?):

In [6]: pl.LazyFrame({"a": []}).select(pl.col('a').value_counts(normalize=True)).collect() Out[6]: shape: (0, 1) ┌───────────┐ │ a │ │ --- │ │ struct[2] │ ╞═══════════╡ └───────────┘

Do we do the right thing here?

Yup appears we do, good idea to check

In [2]: pl.LazyFrame({"a": []}).select(pl.col('a').value_counts(normalize=True)).collect(engine="gpu") Out[2]: shape: (0, 1) ┌───────────┐ │ a │ │ --- │ │ struct[2] │ ╞═══════════╡ └───────────┘ In [3]: pl.LazyFrame({"a": []}).select(pl.col('a').value_counts(normalize=True)).collect(engine="gpu").dtypes Out[3]: [Struct({'a': Null, 'proportion': Float64})] In [4]: pl.LazyFrame({"a": []}).select(pl.col('a').value_counts(normalize=True)).collect().dtypes Out[4]: [Struct({'a': Null, 'proportion': Float64})]

TomAugspurger · 2025-06-18T16:10:58Z

python/cudf_polars/cudf_polars/dsl/expressions/unary.py

+            elif counts_col.type().id() == plc.TypeId.INT32:
+                counts_col = plc.unary.cast(counts_col, plc.DataType(plc.TypeId.UINT32))


Looks like #15852 would be helpful here too.

TomAugspurger · 2025-06-18T16:14:31Z

python/cudf_polars/cudf_polars/dsl/ir.py

+            for child in request.value.children
+        ):
+            raise NotImplementedError(
+                "value_counts is not supported in groupby"


I see a mix of tests that assert we don't support some options, and pragma: no cover for unsupported operations.

Are folks OK with this, or do we want explicit tests for things we don't support? I don't have a strong preference either way. An argument in favor of having a test is that you ensure the if condition is set appropriately (and this one looks kinda complicated).

or do we want explicit tests for things we don't support?

Yeah I suppose we should have our own tests for these operations we don't support, but it's opaquely covered by running the Polars unit tests.

I can add a test for this specific case in #19193

mroeschke · 2025-06-18T17:05:37Z

Will address the follow ups here in #19193

mroeschke · 2025-06-18T17:05:43Z

/merge

mroeschke added 6 commits May 22, 2025 15:23

Implement value_counts in cudf_polars

754afc8

Merge remote-tracking branch 'upstream/branch-25.08' into feat/cudf_p…

a182648

…olars/value_counts

Get no argument value_counts working

a3a5d28

make sort=True, param over name

42e884f

Param over normalize

d762099

Use the same count_agg

4eac271

mroeschke self-assigned this Jun 3, 2025

mroeschke requested a review from a team as a code owner June 3, 2025 22:15

mroeschke requested review from bdice and Matt711 June 3, 2025 22:15

mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cudf-polars Issues specific to cudf-polars labels Jun 3, 2025

github-project-automation bot added this to cuDF Python Jun 3, 2025

github-actions bot added the Python Affects Python cuDF API. label Jun 3, 2025

GPUtester moved this to In Progress in cuDF Python Jun 3, 2025

mroeschke mentioned this pull request Jun 4, 2025

Support passing DataType to Column container in cudf_polars #19091

Merged

3 tasks

mroeschke marked this pull request as draft June 6, 2025 18:31

Merge remote-tracking branch 'upstream/branch-25.08' into feat/cudf_p…

e5dbfa4

…olars/value_counts

mroeschke marked this pull request as ready for review June 12, 2025 16:49

mroeschke added 8 commits June 12, 2025 09:50

Use self.dtype.plc

5afb9f7

Merge branch 'branch-25.08' into feat/cudf_polars/value_counts

993a494

Disallow nested types in can_cast

d05ea65

Merge branch 'feat/cudf_polars/value_counts' of https://github.com/mr…

5831f1b

…oeschke/cudf into feat/cudf_polars/value_counts

Merge remote-tracking branch 'upstream/branch-25.08' into feat/cudf_p…

8473ec1

…olars/value_counts

Merge remote-tracking branch 'upstream/branch-25.08' into feat/cudf_p…

006ad4b

…olars/value_counts

Add xfailing tests

d898b04

Raise NotImplementedError for value_counts in groupby

4f59a1b

mroeschke added 2 commits June 16, 2025 20:00

Add pragma no cover

55979fc

Merge remote-tracking branch 'upstream/branch-25.08' into feat/cudf_p…

5e31420

…olars/value_counts

TomAugspurger approved these changes Jun 18, 2025

View reviewed changes

rapids-bot bot merged commit 2fb78b3 into rapidsai:branch-25.08 Jun 18, 2025
91 checks passed

github-project-automation bot moved this from In Progress to Done in cuDF Python Jun 18, 2025

mroeschke deleted the feat/cudf_polars/value_counts branch June 18, 2025 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support `polars.Expr.value_counts` in `cudf_polars` #19079

Support `polars.Expr.value_counts` in `cudf_polars` #19079

Uh oh!

mroeschke commented Jun 3, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jun 6, 2025

Uh oh!

TomAugspurger left a comment •

edited

Loading

Uh oh!

TomAugspurger Jun 18, 2025

Uh oh!

mroeschke Jun 18, 2025

Uh oh!

Uh oh!

TomAugspurger Jun 18, 2025

Uh oh!

mroeschke Jun 18, 2025

Uh oh!

TomAugspurger Jun 18, 2025

Uh oh!

TomAugspurger Jun 18, 2025

Uh oh!

mroeschke Jun 18, 2025

Uh oh!

mroeschke commented Jun 18, 2025

Uh oh!

mroeschke commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

		elif counts_col.type().id() == plc.TypeId.INT32:
		counts_col = plc.unary.cast(counts_col, plc.DataType(plc.TypeId.UINT32))

Support polars.Expr.value_counts in cudf_polars #19079

Support polars.Expr.value_counts in cudf_polars #19079

Uh oh!

Conversation

mroeschke commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Jun 6, 2025

Uh oh!

TomAugspurger left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

mroeschke Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomAugspurger Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

mroeschke Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

mroeschke Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Jun 18, 2025

Uh oh!

mroeschke commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

Support `polars.Expr.value_counts` in `cudf_polars` #19079

Support `polars.Expr.value_counts` in `cudf_polars` #19079

mroeschke commented Jun 3, 2025 •

edited

Loading

TomAugspurger left a comment •

edited

Loading