New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Refactor predicate pushdown to reuse row group pruning in experimental PQ reader #17946

Merged

rapids-bot merged 18 commits into rapidsai:branch-25.04 from mhaseeb123:fea/refactor-predicate-pushdown

Feb 13, 2025

Member

mhaseeb123 commented Feb 7, 2025 •

edited

Loading

Description

Related to #17896

This PR refactors Parquet reader's predicate pushdown to separate out row group pruning with stats, reading bloom filters, and row group pruning with bloom filters. This allows reusing corresponding functionalities in the experimental PQ reader for highly selective queries (Hybrid scan) as needed.

Note that no code has been added or removed in this PR. Only moved around.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.


          Refactor predicate pushdown to use with hybrid scan

9889ccb

copy-pr-bot bot commented Feb 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions bot assigned mhaseeb123

github-actions bot added the libcudf label

mhaseeb123 commented

View reviewed changes

cpp/src/io/parquet/bloom_filter_reader.cu

@@ @@ -163,108 +162,6 @@ struct bloom_filter_caster { @@
                 }
               };
-              /**

Member Author

mhaseeb123 Feb 7, 2025

Now declared in reader_impl_helpers.hpp and defined at the bottom of this file.

mhaseeb123 commented

View reviewed changes

cpp/src/io/parquet/bloom_filter_reader.cu

		@@ -502,6 +399,17 @@ void read_bloom_filter_data(host_span<std::unique_ptr<datasource> const> sources

		} // namespace

		size_t aggregate_reader_metadata::get_bloom_filter_alignment() const

Member Author

mhaseeb123 Feb 7, 2025

Can't use cuco stuff in .cpp file so separated out here

mhaseeb123 commented

View reviewed changes

cpp/src/io/parquet/bloom_filter_reader.cu

                 std::reference_wrapper<ast::expression const> filter,
                 rmm::cuda_stream_view stream) const
               {
                 // Number of input table columns
                 auto const num_input_columns = static_cast<cudf::size_type>(output_dtypes.size());
-                // Collect equality literals for each input table column

Member Author

mhaseeb123 Feb 7, 2025

Only do the filtering step here, bloom filter buffers are read in predicate_pushdown.cpp and passed here

mhaseeb123 commented

View reviewed changes

cpp/src/io/parquet/predicate_pushdown.cpp

-              std::pair<std::optional<std::vector<std::vector<size_type>>>, surviving_row_group_metrics>
-              aggregate_reader_metadata::filter_row_groups(
-                host_span<std::unique_ptr<datasource> const> sources,
+              std::optional<std::vector<std::vector<size_type>>> aggregate_reader_metadata::apply_stats_filters(

Member Author

mhaseeb123 Feb 7, 2025

Separate out filtering with stats similar to bloom filter API

mhaseeb123 commented

View reviewed changes

cpp/src/io/parquet/predicate_pushdown.cpp

@@ @@ -446,37 +465,75 @@ aggregate_reader_metadata::filter_row_groups( @@
                 // Span of row groups to apply bloom filtering on.
                 auto const bloom_filter_input_row_groups =
-                  filtered_row_group_indices.has_value()
-                    ? host_span<std::vector<size_type> const>(filtered_row_group_indices.value())
+                  stats_filtered_row_groups.has_value()

Member Author

mhaseeb123 Feb 7, 2025

Setup bloom filter buffers here now

mhaseeb123 commented

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.hpp

@@ @@ -237,6 +242,49 @@ class aggregate_reader_metadata { @@
                   host_span<std::vector<size_type> const> row_group_indices,
                   host_span<int const> column_schemas) const;
+                /**

Member Author

mhaseeb123 Feb 7, 2025

These don't need to be public so moved into private section

mhaseeb123 commented

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.hpp

                 std::list<ast::operation> _operators;
               };
+              /**

Member Author

mhaseeb123 Feb 7, 2025 •

edited

Loading

Moved out from bloom_filter_reader.cu so it can be used in predicate_pushdown.cpp as well


          Update docstrings

e5cb699

mhaseeb123 added 2 - In Progress cuIO improvement non-breaking labels

mhaseeb123 marked this pull request as ready for review

February 7, 2025 03:43

mhaseeb123 requested a review from a team as a code owner

February 7, 2025 03:43

mhaseeb123 requested review from devavret and nvdbaranec

February 7, 2025 03:43

mhaseeb123 added 3 - Ready for Review and removed 2 - In Progress labels

mhaseeb123 changed the title ~~Refactor predicate pushdown to use with hybrid scan~~ Refactor predicate pushdown to use filtering in experimental PQ reader

mhaseeb123 changed the title ~~Refactor predicate pushdown to use filtering in experimental PQ reader~~ Refactor predicate pushdown to reuse row group pruning in experimental PQ reader


          Merge branch 'branch-25.04' into fea/refactor-predicate-pushdown

d8f7c9e

mhaseeb123 requested a review from vuule

February 7, 2025 19:41

karthikeyann reviewed

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved

mhaseeb123 commented

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved


          Update cpp/src/io/parquet/reader_impl_helpers.hpp

d322a7e

mhaseeb123 requested a review from karthikeyann

February 8, 2025 01:00

mhaseeb123 added 4 commits

February 10, 2025 22:55


          Rename equality_literals to simply literals for reuse elsewhere

564f8e1


          Make literals protected instead of private.

3fd14d0


          Add the missed replacements

20528ba


          Revert literals to private for now

e842d86

mhaseeb123 commented

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.hpp

+                size_type _num_input_columns;
+               private:
+                std::vector<std::vector<ast::literal*>> _literals;

Member Author

mhaseeb123 Feb 11, 2025

Renamed to just _literals and mobed to protected to reuse in a child class for pruning row groups with dictionary pages.

karthikeyann approved these changes

View reviewed changes

mhaseeb123 commented

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.hpp

+                 *
+                 * @return Vectors of equality literals, one per input table column
+                 */
+                [[nodiscard]] std::vector<std::vector<ast::literal*>> get_literals() &&;

Member Author

mhaseeb123 Feb 11, 2025

Renamed this function from get_equality_literals() to a more generic get_literals()

vuule approved these changes

View reviewed changes


          Merge branch 'branch-25.04' into fea/refactor-predicate-pushdown

f82fcb5

github-actions bot assigned vuule

Contributor

vuule commented Feb 11, 2025

Looking at the CI errors, it seems like something in benchmarks used to depend on transitively included , and the chain of dependencies has been broken somehow.

mhaseeb123 added 5 - Ready to Merge and removed 3 - Ready for Review labels


          include missing header

694cf19

mhaseeb123 commented

View reviewed changes

cpp/benchmarks/hashing/partition.cpp Outdated Show resolved Hide resolved

mhaseeb123 and others added 8 commits

February 11, 2025 23:59


          Fix copyrights year

cbe6669


          Add missing header

c5bc69f


          Merge branch 'branch-25.04' into fea/refactor-predicate-pushdown

34420c2


          Add missing @param in docstring

38948d6


          Rename equality_literals to simply literals


          Merge branch 'branch-25.04' into fea/refactor-predicate-pushdown

97d787d


          Merge branch 'branch-25.04' into fea/refactor-predicate-pushdown

978021d


          Merge branch 'branch-25.04' into fea/refactor-predicate-pushdown

2ebd9bb

Member Author

mhaseeb123 commented Feb 13, 2025

/merge

rapids-bot bot merged commit 6c281fd into rapidsai:branch-25.04

108 of 109 checks passed

mhaseeb123 deleted the fea/refactor-predicate-pushdown branch

February 13, 2025 22:10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

karthikeyann karthikeyann approved these changes

vuule vuule approved these changes

devavret Awaiting requested review from devavret devavret is a code owner automatically assigned from rapidsai/cudf-cpp-codeowners

nvdbaranec Awaiting requested review from nvdbaranec nvdbaranec is a code owner automatically assigned from rapidsai/cudf-cpp-codeowners

Labels

5 - Ready to Merge cuIO improvement libcudf non-breaking