[FEA] Add a new Parquet reader for high-selectivity table scan #17896
Labels
cuIO
cuIO issue
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Spark
Functionality that helps Spark RAPIDS
Milestone
Is your feature request related to a problem? Please describe.
As of 25.04, Spark-RAPIDS performs several additional processing steps as part of the cuDF-based Parquet table scan implementation. First, Spark-RAPIDS parses the file footer and prunes out columns and row groups not needed by the table scan task. Then the plugin uses
parquet-mr
functions to apply the filter predicate to min/max stats in the footer, further pruning row groups. Next, Spark-RAPIDS completes IO for all the column chunks that remain, optionally decompresses these on the host, and then assembles a new parquet file in host pinned memory with a new footer. Then cuDF's chunked parquet reader materializes the data in chunks to remain with the task's memory foot print. Finally, Spark-RAPIDS applies the predicate filter row-wise.This implementation has several inefficiencies, especially with high-selectivity table scans. First, the file footer is parsed twice in Spark-RAPIDS, written again for cuDF, and parsed by cuDF. Second, the IO for all column chunks is performed without checking bloom filters, dictionary pages, or the column index to see which data pages could actually pass the filter predicate. This means that excess IO, decompression and decoding will happen on data that ends up discarded during the row-wise predicate filtering.
Describe the solution you'd like
We should add new functions to an experimental namespace, exposing the steps needed to process a high-selectivity parquet table scan. These steps should minimize the total IO as much as possible, using metadata in the footer, column index, dictionary pages, bloom filters to avoid IO at the data page level. The steps should also be stateless, to give Spark-RAPIDS better control of spilling, chunking and retries.
1. Parse metadata from footer bytes.
Function
get_valid_row_groups
(needs GPU for stats based pruning):Do we need to prune columns from the parquet metadata? Or would that break the filter column indexing? I guess we could prune and reorder the columns and update the filter predicate…
This reader should only support files with column index data that is near the file footer and not support the older spec with column index data in the page headers.
2. Get byte ranges of secondary filters: dictionary pages and bloom filters
Function
get_secondary_filters
(no GPU):3. Prune valid row groups using secondary filters
Function
prune_row_groups_by_dictionary_pages
(needs GPU):Function
prune_row_groups_by_bloom_filters
(needs GPU):Function
update_validity_with_column_index
(needs GPU)4. Request filter column data pages
Function
get_filter_columns_data_pages
(no GPU):Function
materialize_filter_columns
(needs GPU):Function
apply_filter_predicate
(needs GPU):5. Request payload column data pages
Function
get_payload_columns_data_pages
(no GPU):Function
materialize_payload_columns
(needs GPU):Additional context
Also see issues about refactoring the footer management in the "bulk reader" table scan (default in Spark-RAPIDS as of 25.04). #17716
The text was updated successfully, but these errors were encountered: