Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor predicate pushdown to reuse row group pruning in experimental PQ reader #17946

Merged

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Feb 7, 2025

Description

Related to #17896

This PR refactors Parquet reader's predicate pushdown to separate out row group pruning with stats, reading bloom filters, and row group pruning with bloom filters. This allows reusing corresponding functionalities in the experimental PQ reader for highly selective queries (Hybrid scan) as needed.

Note that no code has been added or removed in this PR. Only moved around.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Feb 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 7, 2025
@@ -163,108 +162,6 @@ struct bloom_filter_caster {
}
};

/**
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now declared in reader_impl_helpers.hpp and defined at the bottom of this file.

@@ -502,6 +399,17 @@ void read_bloom_filter_data(host_span<std::unique_ptr<datasource> const> sources

} // namespace

size_t aggregate_reader_metadata::get_bloom_filter_alignment() const
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use cuco stuff in .cpp file so separated out here

std::reference_wrapper<ast::expression const> filter,
rmm::cuda_stream_view stream) const
{
// Number of input table columns
auto const num_input_columns = static_cast<cudf::size_type>(output_dtypes.size());

// Collect equality literals for each input table column
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only do the filtering step here, bloom filter buffers are read in predicate_pushdown.cpp and passed here

std::pair<std::optional<std::vector<std::vector<size_type>>>, surviving_row_group_metrics>
aggregate_reader_metadata::filter_row_groups(
host_span<std::unique_ptr<datasource> const> sources,
std::optional<std::vector<std::vector<size_type>>> aggregate_reader_metadata::apply_stats_filters(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate out filtering with stats similar to bloom filter API

@@ -446,37 +465,75 @@ aggregate_reader_metadata::filter_row_groups(

// Span of row groups to apply bloom filtering on.
auto const bloom_filter_input_row_groups =
filtered_row_group_indices.has_value()
? host_span<std::vector<size_type> const>(filtered_row_group_indices.value())
stats_filtered_row_groups.has_value()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setup bloom filter buffers here now

@@ -237,6 +242,49 @@ class aggregate_reader_metadata {
host_span<std::vector<size_type> const> row_group_indices,
host_span<int const> column_schemas) const;

/**
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't need to be public so moved into private section

@@ -513,6 +538,54 @@ class named_to_reference_converter : public ast::detail::expression_transformer
std::list<ast::operation> _operators;
};

/**
Copy link
Member Author

@mhaseeb123 mhaseeb123 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved out from bloom_filter_reader.cu so it can be used in predicate_pushdown.cpp as well

@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 7, 2025
@mhaseeb123 mhaseeb123 marked this pull request as ready for review February 7, 2025 03:43
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner February 7, 2025 03:43
@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 7, 2025
@mhaseeb123 mhaseeb123 changed the title Refactor predicate pushdown to use with hybrid scan Refactor predicate pushdown to use filtering in experimental PQ reader Feb 7, 2025
@mhaseeb123 mhaseeb123 changed the title Refactor predicate pushdown to use filtering in experimental PQ reader Refactor predicate pushdown to reuse row group pruning in experimental PQ reader Feb 7, 2025
@mhaseeb123 mhaseeb123 requested a review from vuule February 7, 2025 19:41
size_type _num_input_columns;

private:
std::vector<std::vector<ast::literal*>> _literals;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to just _literals and mobed to protected to reuse in a child class for pruning row groups with dictionary pages.

*
* @return Vectors of equality literals, one per input table column
*/
[[nodiscard]] std::vector<std::vector<ast::literal*>> get_literals() &&;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this function from get_equality_literals() to a more generic get_literals()

@vuule
Copy link
Contributor

vuule commented Feb 11, 2025

Looking at the CI errors, it seems like something in benchmarks used to depend on transitively included , and the chain of dependencies has been broken somehow.

@mhaseeb123 mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Feb 11, 2025
@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 6c281fd into rapidsai:branch-25.04 Feb 13, 2025
108 of 109 checks passed
@mhaseeb123 mhaseeb123 deleted the fea/refactor-predicate-pushdown branch February 13, 2025 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants