-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor predicate pushdown to reuse row group pruning in experimental PQ reader #17946
Refactor predicate pushdown to reuse row group pruning in experimental PQ reader #17946
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
@@ -163,108 +162,6 @@ struct bloom_filter_caster { | |||
} | |||
}; | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now declared in reader_impl_helpers.hpp
and defined at the bottom of this file.
@@ -502,6 +399,17 @@ void read_bloom_filter_data(host_span<std::unique_ptr<datasource> const> sources | |||
|
|||
} // namespace | |||
|
|||
size_t aggregate_reader_metadata::get_bloom_filter_alignment() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't use cuco stuff in .cpp file so separated out here
std::reference_wrapper<ast::expression const> filter, | ||
rmm::cuda_stream_view stream) const | ||
{ | ||
// Number of input table columns | ||
auto const num_input_columns = static_cast<cudf::size_type>(output_dtypes.size()); | ||
|
||
// Collect equality literals for each input table column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only do the filtering step here, bloom filter buffers are read in predicate_pushdown.cpp
and passed here
std::pair<std::optional<std::vector<std::vector<size_type>>>, surviving_row_group_metrics> | ||
aggregate_reader_metadata::filter_row_groups( | ||
host_span<std::unique_ptr<datasource> const> sources, | ||
std::optional<std::vector<std::vector<size_type>>> aggregate_reader_metadata::apply_stats_filters( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate out filtering with stats similar to bloom filter API
@@ -446,37 +465,75 @@ aggregate_reader_metadata::filter_row_groups( | |||
|
|||
// Span of row groups to apply bloom filtering on. | |||
auto const bloom_filter_input_row_groups = | |||
filtered_row_group_indices.has_value() | |||
? host_span<std::vector<size_type> const>(filtered_row_group_indices.value()) | |||
stats_filtered_row_groups.has_value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setup bloom filter buffers here now
@@ -237,6 +242,49 @@ class aggregate_reader_metadata { | |||
host_span<std::vector<size_type> const> row_group_indices, | |||
host_span<int const> column_schemas) const; | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These don't need to be public so moved into private section
@@ -513,6 +538,54 @@ class named_to_reference_converter : public ast::detail::expression_transformer | |||
std::list<ast::operation> _operators; | |||
}; | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved out from bloom_filter_reader.cu
so it can be used in predicate_pushdown.cpp
as well
size_type _num_input_columns; | ||
|
||
private: | ||
std::vector<std::vector<ast::literal*>> _literals; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to just _literals and mobed to protected to reuse in a child class for pruning row groups with dictionary pages.
* | ||
* @return Vectors of equality literals, one per input table column | ||
*/ | ||
[[nodiscard]] std::vector<std::vector<ast::literal*>> get_literals() &&; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed this function from get_equality_literals()
to a more generic get_literals()
Looking at the CI errors, it seems like something in benchmarks used to depend on transitively included , and the chain of dependencies has been broken somehow. |
/merge |
6c281fd
into
rapidsai:branch-25.04
Description
Related to #17896
This PR refactors Parquet reader's predicate pushdown to separate out row group pruning with stats, reading bloom filters, and row group pruning with bloom filters. This allows reusing corresponding functionalities in the experimental PQ reader for highly selective queries (Hybrid scan) as needed.
Note that no code has been added or removed in this PR. Only moved around.
Checklist