[WIP] Track column statistics in cuDF-Polars #19130

rjzamora · 2025-06-10T19:58:53Z

Description

Supersedes #18865

An important goal of this PR is to lay the foundation for the kind of Join-based statistics gathering prototyped by @wence- in wence-:more-tablestat-doodles. This PR does NOT implement most of logic in that branch. However, it does implement a compatible foundation for that work.

In wence-:more-tablestat-doodles, statistics are collected in two passes over the original logical plan. The first pass is essentially collecting statistics originating only from Scan/DataFrameScan IR nodes. The second pass updates these "base" statistics to account for Join/Filters/GroupBy/etc. This PR only implements the first pass. However, we now collect the "base" statistics (now referred to as "source" statistics) in a format that can also be used for partitioning, and to choose between shuffle- and reduction-based aggregations.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-06-10T19:58:57Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rjzamora · 2025-06-10T20:07:40Z

python/cudf_polars/cudf_polars/dsl/traversal.py

@@ -49,6 +50,40 @@ def traversal(nodes: Sequence[NodeT]) -> Generator[NodeT, None, None]:
                lifo.append(child)


+def post_traversal(nodes: Sequence[NodeT]) -> Generator[NodeT, None, None]:


NOTE: This was copied from wence-:more-tablestat-doodles.

rjzamora · 2025-06-10T20:20:30Z

/ok to test

TomAugspurger

Happy to see this coming together. Gave a quick pass and will look more closely later.

python/cudf_polars/cudf_polars/dsl/traversal.py

python/cudf_polars/cudf_polars/experimental/base.py

TomAugspurger · 2025-06-10T20:33:46Z

python/cudf_polars/cudf_polars/experimental/io.py

+_SOURCE_STATS_CACHE_MAX_ITEMS: int = 10
+
+
+def _update_source_stats_cache(


I haven't looked closely, but would functools.lru_cache(maxsize=10) work? https://docs.python.org/3/library/functools.html#functools.lru_cache

Looked through this some more. I think the tl/dr is that we could maybe
force this into using lru_cache. It's unclear to me whether we should.

The idea here is to cache tuple[path, ...] -> dict[str, ColumnSourceStats], i.e.
the column statistics for a given tuple of paths. Note that the paths
in the key are for everything in the IR, not just the ones sampled.

Some of the complexity seems to come potentially hitting this cache with
the same paths, but a different subset of columns of interest: We might
have something like

a = pl.scan_parquet("data.parquet", columns=["a", "b"]) b = pl.scan_parquet("data.parquet", columns=["b", "c"])

Note that b is in both, and the full table might have many columns. We want
to avoid computing the stats for b twice, and we want to avoid computing the
stats for columns that we'll never use. This means a simple lru_cache with
an entry per fileir.paths isn't going to give us what we want.

This seems doable with a functools.lru_cache on the tuple[path, ...] key by
having it return a (mutable) dict[str, ColumnSourceStats]. Then any callers
asking for the stats of a given set of paths will get a view on the same dict,
which they can mutate inplace. Something like

@functools.lru_cache def get_source_stats(paths: tuple[str, ...]) -> dict[str, ColumnSourceStats]: return {} source_stats_cached = get_source_stats(paths) for column in need_columns - set(source_stats_cached): # compute for that column source_stats_cached[column] = ColumnSourceStats(...)

Maybe this isn't much better. Mutating cached values is dangerous.

But it does give us an LRU cache rather than a FIFO cache, along with all
the nice things from lru_cache like cache_info().

You could indirect things by having a two-level scheme:

@functools.lru_cache def stats_getter(paths: tuple[str, ...]) -> Callable[[str], ColumnSourceStats]: @functools.lru_cache def colstats(column: str) -> ColumnSourceStats: return ... ... colstats = stats_getter(paths)("column")

?

Ah I was trying to figure out how to get the two layer thing to work, but couldn't do it in a way that didn't break the maxsize of the outer cache. But I see now that that should be equivalent: set a maxsize of 10 or whatever on the stats_getter and an unlimited size on the colstats. I think this should work out nicely.

python/cudf_polars/cudf_polars/experimental/io.py

TomAugspurger · 2025-06-10T20:50:42Z

python/cudf_polars/cudf_polars/experimental/base.py

+        self.cardinality: dict[IR, int] = {}
+        self.column_statistics: dict[IR, dict[str, ColumnStats]] = {}


Are the keys of these two dictionaries always the same?

If so, maybe we model this as a something like a record or tuple with two elements: one for the cardinality and one for the column statistics? That way they can't get out of sync?

If these keys can differ, then disregard.

https://github.com/rapidsai/cudf/pull/19130/files#diff-fe5f5ae8b1b9a9f2ea2369b2ae9237d407170e4514c4a866ce0631b47b248511R42 is an example where we just update column_statistics, not cardinatlity, so these can differ.

Another comment: we should probably be consistent about stats vs. statistics in our names, unless there's a strong reason not to. I don't have a preference for which.

is an example where we just update column_statistics, not cardinatlity, so these can differ.

Yeah, this is also one place where I'm trying to keep the API (StatsCollector to be specific) similar to Lawrence's prototype/doodle.

Another comment: we should probably be consistent about stats vs. statistics in our names, unless there's a strong reason not to. I don't have a preference for which.

Yup - It seems like the inconsistency is because I tend to use stats and Lawrence tends to use statistics. I'm also indifferent, but agree that we should be consistent.

python/cudf_polars/cudf_polars/experimental/io.py

TomAugspurger · 2025-06-10T20:59:32Z

python/cudf_polars/cudf_polars/experimental/base.py

+
+    Parameters
+    ----------
+    table_source


I'm a bit surprised (not in a negative way) to see this modeled as ColumnSourceStats having a TableSourceStats field. I would naively expect something like

class TableSourceStats: paths: tuple[str, ...] cardinality: int | None column_statistics: dict[str, ColumnSourceStats]

And then we have one TableSourceStats per source, and each TableSourceStats has one ColumnSourceStats per column.

But I haven't looked at how this is used yet. Maybe in practice we're only ever looking at stats for a particular column, and going through the table might be difficult / impossible?

This is a good question. I'm leaning pretty hard in the direction of: TableSourceStats is unnecessary, and we should just add a cardinality attribute to ColumnSourceStats.

I originally started with the design you sketched above. However, since we are mostly interested in working with ColumnStats at the IR level, we end up needing to do something like column_stats.table_source.column_statistics[original_column_name] to get to the column-source statistics. This is both verbose, and opens us up to confusion if the name of the column has changed.

The current design shortens this to column_stats.source_stats, and side-steps the possibility of column-renaming causing a problem. However, I still don't see a clear reason why the TableSourceStats class needs to exist at all.

and we should just add a cardinality attribute to ColumnSourceStats.

That seems pretty reasonable to me. We will end up with situations where we have the same key from different tables (e.g. primary key in one, foreign key in another), but that'll occur regardless of whether cardinality is on ColumnSourceStats or TableSourceStats I think.

TomAugspurger

Went through building the ColumnSourceStats in parquet. I haven't gone through the caching yet.

TomAugspurger · 2025-06-11T14:10:54Z

python/cudf_polars/cudf_polars/experimental/io.py

+            plc.io.SourceInfo(paths)
+        )
+        num_rows_per_file = int(metadata.num_rows() / len(paths))
+        num_rows_total = num_rows_per_file * file_count


Is this just num_rows_total = metadata.num_rows()? It seems like we're doing something close to that

num_rows_total = num_rows_per_file * file_count = int(metadata.num_rows() / len(paths)) * len(paths)

so they should be the same other than rounding from the int.

Ah, not quite because of the sampling. I'd suggest a comment explaining how we'll have some directly known values from the sampled files and estimates extrapolated from those. And then a convention of pre/post-fixing variable names with what type we're dealing with. I see num_rows_total, total_uncompressed_size so maybe align those to both use a prefix.

And maybe it's worth emphasizing that the total row count and total uncompressed size are estimates by including those in the variable names.

TomAugspurger · 2025-06-11T14:12:19Z

python/cudf_polars/cudf_polars/experimental/io.py

+        rowgroup_offsets_per_file = np.insert(
+            np.cumsum(num_row_groups_per_file_samples), 0, 0


Should be the same & probably faster

Suggested change

rowgroup_offsets_per_file = np.insert(

np.cumsum(num_row_groups_per_file_samples), 0, 0

rowgroup_offsets_per_file = np.cumsum([0] + num_row_groups_per_file_samples)

TomAugspurger · 2025-06-11T14:36:39Z

python/cudf_polars/cudf_polars/experimental/io.py

+        column_sizes = {}
+        for name, uncompressed_sizes in metadata.columnchunk_metadata().items():
+            if name in need_columns:
+                column_sizes[name] = np.array(


Why do we need an array here, rather than an aggregated value? I see we derive total_uncompressed_size by taking a mean later. Does something else use the unaggregated values?

Note to self: the length of this array should exactly match the number of files sampled.

TomAugspurger · 2025-06-11T14:38:36Z

python/cudf_polars/cudf_polars/experimental/io.py

+        # We have un-cached column metadata to process
+
+        # Calculate the mean per-file `total_uncompressed_size` for each column
+        total_uncompressed_size = {


Is total correct here, or should this be avg? IIUC this represents our estimate of the in-memory size of a given column for any given source file.

TomAugspurger · 2025-06-11T14:50:05Z

python/cudf_polars/cudf_polars/experimental/io.py

+            for path, num_rgs in zip(
+                paths, num_row_groups_per_file_samples, strict=True
+            ):
+                for rg_id in range(num_rgs):
+                    n += 1
+                    samples[path].append(rg_id)
+                    if n == num_rg_samples:
+                        break
+                if n == num_rg_samples:
+                    break


IIUC, this is building up a list of specific row groups to sample. It will be biased to sample row groups from early files.

Maybe it'd be better to build up the full list of (file, rowgroup_id) and then slice through that, like we do with the files above? Roughly:

samples = [ (file, i) for file, num_rgs in zip(paths, num_row_groups_per_file_samples, strict=True) for i in range(num_rgs) ] stride = max(1, int(len(samples) / num_rg_samples)) samples = samples[::stride]

Maybe not though. The number of files will be limited by the config option, but the number of for groups per file could be very large (depends on the parquet file) so that list could be pretty large. There's probably a smart way to do this with some math.

TomAugspurger · 2025-06-11T14:54:04Z

python/cudf_polars/cudf_polars/experimental/io.py

+                    unique_fraction_estimates[name] = max(
+                        min(1.0, row_group_unique_count / row_group_num_rows),
+                        0.00001,


What's the motivation behind this max? I suppose that with very large row groups, this fraction can get arbitrarily close to zero. And that's something we don't want callers having to think about?

Do we lose anything by truncating to 0.00001?

TomAugspurger · 2025-06-11T15:00:17Z

python/cudf_polars/cudf_polars/experimental/base.py

+        Unique-count estimate.
+    unique_fraction
+        Unique-fraction estimate.
+    file_size


This name confused me initially. It's not the size of the source (e.g. Parquet) file on disk. It's the estimated in-memory size of some column, derived by sampling a few source files.

TomAugspurger · 2025-06-11T15:07:14Z

python/cudf_polars/cudf_polars/experimental/io.py

+        # Leave out unique stats if they were defined by the
+        # user. This allows us to avoid collecting stats for
+        # columns that are know to be problematic.
+        user_fractions = ir.config_options.executor.unique_fraction


Thinking through how I feel about this design. IIUC, the intent is to always give preference to user-provided statistics over stats from the source, which makes sense. _get_unique_fractions merges these two here.

Initially I wondered why we had to worry about this in two places: here and _get_unique_fractions. I think the answers are

We worry about them here as an optimization: we avoid computing stats for things that'll just be overridden later

We worry about them in _get_unique_fractions since we can (in principal) have other places generating these statistics (like a DataFrameScan).

So we could maybe cut out _get_unique_fractions having to merge these by requiring whoever produces these ColumnSourceStats to do the merging, with preference for user-provided stats. Dunno if that's worth it, but I wrote this up to understand things so I'll submit it for discussion :)

…tats

rjzamora · 2025-06-12T15:20:22Z

/ok to test

python/cudf_polars/cudf_polars/dsl/traversal.py

TomAugspurger · 2025-06-12T15:23:18Z

python/cudf_polars/cudf_polars/utils/config.py

    target_partition_size: int = 0
+    parquet_metadata_samples: int = 3


The docs say 5 is the default.

TomAugspurger · 2025-06-12T15:26:28Z

python/cudf_polars/cudf_polars/experimental/base.py

+        self.cardinality: dict[IR, int] = {}
+        self.column_statistics: dict[IR, dict[str, ColumnStats]] = {}


https://github.com/rapidsai/cudf/pull/19130/files#diff-fe5f5ae8b1b9a9f2ea2369b2ae9237d407170e4514c4a866ce0631b47b248511R42 is an example where we just update column_statistics, not cardinatlity, so these can differ.

TomAugspurger · 2025-06-12T15:27:46Z

python/cudf_polars/cudf_polars/experimental/base.py

+        self.cardinality: dict[IR, int] = {}
+        self.column_statistics: dict[IR, dict[str, ColumnStats]] = {}


Another comment: we should probably be consistent about stats vs. statistics in our names, unless there's a strong reason not to. I don't have a preference for which.

TomAugspurger · 2025-06-12T15:31:18Z

python/cudf_polars/cudf_polars/experimental/distinct.py

+        column_statistics,
+    )
+
+    unique_fraction = (


Is there a difference between {} and None here? I think that unique_fraction_dict could be {} here if both column_statistics and config_options.executor.unique_fraction are empty. Then we'll have bool(unique_fraction_dict) is False and get the None.

TomAugspurger · 2025-06-12T15:43:49Z

python/cudf_polars/cudf_polars/experimental/utils.py

+    config_options: ConfigOptions,
+    column_statistics: MutableMapping[str, ColumnStats],
+) -> dict[str, float]:
+    assert config_options.executor.name == "streaming", (


One way to avoid this assert here is to change the function signature to take user_unique_fraction: dict[str, float] and then make the caller responsible for this assertion.

It's possible that this will lead to more of these assertions (e.g. two functions call this, and neither is already making this assertion), but it reduces the "types" of places where we have to make this assertion. I'd have a slight preference to push as many of these assertions to the "edge" if possible.

_decompose_unique is a somewhat good example of why pushing this to the edge might be better. It calls into _get_unique_fractions with config_options. I'm not sure whether _decompose_unique is only called with a streaming executor, though, so we'd have to keep walking all the way up to see. I'm not 100% sure, but it does look like we could hit here with the in-memory executor.

…tats

rjzamora added 7 commits June 5, 2025 07:18

start with lawrences doodle

b9bb173

save work

8699a9e

revise basic class structure

2d590dc

Merge remote-tracking branch 'upstream/branch-25.08' into column-stats

6620a6f

tests passing

511d059

change the config name

2cf53d2

Merge remote-tracking branch 'upstream/branch-25.08' into column-stats

4505d7d

rjzamora self-assigned this Jun 10, 2025

rjzamora added the 2 - In Progress Currently a work in progress label Jun 10, 2025

github-project-automation bot added this to cuDF Python Jun 10, 2025

rjzamora added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cudf-polars Issues specific to cudf-polars labels Jun 10, 2025

github-actions bot added the Python Affects Python cuDF API. label Jun 10, 2025

GPUtester moved this to In Progress in cuDF Python Jun 10, 2025

rjzamora changed the title ~~[WIP] Tack column statistics in cuDF-Polars~~ [WIP] Track column statistics in cuDF-Polars Jun 10, 2025

rjzamora commented Jun 10, 2025

View reviewed changes

Merge branch 'branch-25.08' into column-stats

49ce228

TomAugspurger reviewed Jun 10, 2025

View reviewed changes

TomAugspurger reviewed Jun 11, 2025

View reviewed changes

rjzamora added 4 commits June 11, 2025 09:11

remove TableSourceStats

4fee62c

Merge remote-tracking branch 'upstream/branch-25.08' into column-stats

88247f4

minor cleanup

2d0c43d

Merge branch 'column-stats' of github.com:rjzamora/cudf into column-s…

455ad2d

…tats

TomAugspurger mentioned this pull request Jun 11, 2025

Used TypeDict for CachingVisitor.state #19135

Open

rjzamora added 2 commits June 12, 2025 06:38

Merge remote-tracking branch 'upstream/branch-25.08' into column-stats

e4f284c

test coverage

63e650a

rjzamora commented Jun 12, 2025

View reviewed changes

python/cudf_polars/cudf_polars/dsl/traversal.py Outdated Show resolved Hide resolved

Update python/cudf_polars/cudf_polars/dsl/traversal.py

c076ec4

TomAugspurger reviewed Jun 12, 2025

View reviewed changes

rjzamora added 5 commits June 12, 2025 09:07

use LRU instead of FIFO

6672aa3

Merge branch 'column-stats' of github.com:rjzamora/cudf into column-s…

2def2df

…tats

Merge remote-tracking branch 'upstream/branch-25.08' into column-stats

daedd2d

avoid key errors

adce001

Merge remote-tracking branch 'upstream/branch-25.08' into column-stats

b7f52e4

		@@ -49,6 +50,40 @@ def traversal(nodes: Sequence[NodeT]) -> Generator[NodeT, None, None]:
		lifo.append(child)


		def post_traversal(nodes: Sequence[NodeT]) -> Generator[NodeT, None, None]:

		_SOURCE_STATS_CACHE_MAX_ITEMS: int = 10


		def _update_source_stats_cache(

		self.cardinality: dict[IR, int] = {}
		self.column_statistics: dict[IR, dict[str, ColumnStats]] = {}

		rowgroup_offsets_per_file = np.insert(
		np.cumsum(num_row_groups_per_file_samples), 0, 0

	rowgroup_offsets_per_file = np.insert(
	np.cumsum(num_row_groups_per_file_samples), 0, 0
	rowgroup_offsets_per_file = np.cumsum([0] + num_row_groups_per_file_samples)

		target_partition_size: int = 0
		parquet_metadata_samples: int = 3

[WIP] Track column statistics in cuDF-Polars #19130

Are you sure you want to change the base?

[WIP] Track column statistics in cuDF-Polars #19130

Uh oh!

Conversation

rjzamora commented Jun 10, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented Jun 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Jun 10, 2025

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Jun 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora Jun 11, 2025 •

edited

Loading