-
Notifications
You must be signed in to change notification settings - Fork 952
[WIP] Track column statistics in cuDF-Polars #19130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rjzamora
wants to merge
28
commits into
rapidsai:branch-25.08
Choose a base branch
from
rjzamora:column-stats
base: branch-25.08
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+683
−96
Open
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
b9bb173
start with lawrences doodle
rjzamora 8699a9e
save work
rjzamora 2d590dc
revise basic class structure
rjzamora 6620a6f
Merge remote-tracking branch 'upstream/branch-25.08' into column-stats
rjzamora 511d059
tests passing
rjzamora 2cf53d2
change the config name
rjzamora 4505d7d
Merge remote-tracking branch 'upstream/branch-25.08' into column-stats
rjzamora 49ce228
Merge branch 'branch-25.08' into column-stats
rjzamora 4fee62c
remove TableSourceStats
rjzamora 88247f4
Merge remote-tracking branch 'upstream/branch-25.08' into column-stats
rjzamora 2d0c43d
minor cleanup
rjzamora 455ad2d
Merge branch 'column-stats' of github.com:rjzamora/cudf into column-s…
rjzamora e4f284c
Merge remote-tracking branch 'upstream/branch-25.08' into column-stats
rjzamora 63e650a
test coverage
rjzamora c076ec4
Update python/cudf_polars/cudf_polars/dsl/traversal.py
rjzamora 6672aa3
use LRU instead of FIFO
rjzamora 2def2df
Merge branch 'column-stats' of github.com:rjzamora/cudf into column-s…
rjzamora daedd2d
Merge remote-tracking branch 'upstream/branch-25.08' into column-stats
rjzamora adce001
avoid key errors
rjzamora b7f52e4
Merge remote-tracking branch 'upstream/branch-25.08' into column-stats
rjzamora 49802a9
Merge remote-tracking branch 'upstream/branch-25.08' into column-stats
rjzamora afd1a9b
rename column_statistics to column_stats
rjzamora c6af3b4
more renaming of statistics to stats
rjzamora 6b162b3
pull config_options back out of StatsCollector
rjzamora 6d168eb
Merge remote-tracking branch 'upstream/branch-25.08' into column-stats
rjzamora 2208fad
fix typo
rjzamora ab981c5
change _get_unique_fractions input types
rjzamora 7cfd6bf
Merge branch 'branch-25.08' into column-stats
rjzamora File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,7 +13,11 @@ | |
from cudf_polars.dsl.ir import Distinct | ||
from cudf_polars.experimental.base import PartitionInfo | ||
from cudf_polars.experimental.dispatch import lower_ir_node | ||
from cudf_polars.experimental.utils import _fallback_inform, _lower_ir_fallback | ||
from cudf_polars.experimental.utils import ( | ||
_fallback_inform, | ||
_get_unique_fractions, | ||
_lower_ir_fallback, | ||
) | ||
|
||
if TYPE_CHECKING: | ||
from collections.abc import MutableMapping | ||
|
@@ -29,7 +33,7 @@ def lower_distinct( | |
partition_info: MutableMapping[IR, PartitionInfo], | ||
config_options: ConfigOptions, | ||
*, | ||
cardinality: float | None = None, | ||
unique_fraction: float | None = None, | ||
) -> tuple[IR, MutableMapping[IR, PartitionInfo]]: | ||
""" | ||
Lower a Distinct IR into partition-wise stages. | ||
|
@@ -46,8 +50,8 @@ def lower_distinct( | |
associated partitioning information. | ||
config_options | ||
GPUEngine configuration options. | ||
cardinality | ||
Cardinality factor to use for algorithm selection. | ||
unique_fraction | ||
Fractional unique count to use for algorithm selection. | ||
|
||
Returns | ||
------- | ||
|
@@ -112,14 +116,14 @@ def lower_distinct( | |
# partitions. For now, we raise an error to fall back | ||
# to one partition. | ||
raise NotImplementedError("Unsupported slice for multiple partitions.") | ||
elif cardinality is not None: | ||
# Use cardinality to determine partitioningcardinality | ||
n_ary = min(max(int(1.0 / cardinality), 2), child_count) | ||
output_count = max(int(cardinality * child_count), 1) | ||
elif unique_fraction is not None: | ||
# Use unique_fraction to determine partitioning | ||
n_ary = min(max(int(1.0 / unique_fraction), 2), child_count) | ||
output_count = max(int(unique_fraction * child_count), 1) | ||
|
||
if output_count > 1 and require_tree_reduction: | ||
# Need to reduce down to a single partition even | ||
# if the cardinality is large. | ||
# if the unique_fraction is large. | ||
output_count = 1 | ||
_fallback_inform( | ||
"Unsupported unique options for multiple partitions.", | ||
|
@@ -164,24 +168,30 @@ def _( | |
# Extract child partitioning | ||
child, partition_info = rec(ir.children[0]) | ||
config_options = rec.state["config_options"] | ||
column_stats = rec.state["stats"].column_stats.get(ir.children[0], {}) | ||
|
||
assert config_options.executor.name == "streaming", ( | ||
"'in-memory' executor not supported in 'lower_ir_node'" | ||
) | ||
|
||
subset: frozenset = ir.subset or frozenset(ir.schema) | ||
cardinality_factor = { | ||
c: max(min(f, 1.0), 0.00001) | ||
for c, f in config_options.executor.cardinality_factor.items() | ||
if c in subset | ||
} | ||
cardinality = max(cardinality_factor.values()) if cardinality_factor else None | ||
subset: frozenset[str] = ir.subset or frozenset(ir.schema) | ||
unique_fraction_dict = _get_unique_fractions( | ||
tuple(subset), | ||
config_options.executor.unique_fraction, | ||
column_stats, | ||
) | ||
|
||
unique_fraction = ( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a difference between |
||
max(unique_fraction_dict.values()) if unique_fraction_dict else None | ||
) | ||
|
||
try: | ||
return lower_distinct( | ||
ir, | ||
child, | ||
partition_info, | ||
config_options, | ||
cardinality=cardinality, | ||
unique_fraction=unique_fraction, | ||
) | ||
except NotImplementedError as err: | ||
return _lower_ir_fallback(ir, rec, msg=str(err)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this source stats specific? I see we also do stuff for
Join
, which IIUC just does stuff withColumnStats
. Maybe rename tocollect_stats
?