[FEA] Add a new Parquet reader for high-selectivity table scan #17896

GregoryKimball · 2025-01-31T22:46:26Z

Is your feature request related to a problem? Please describe.
As of 25.04, Spark-RAPIDS performs several additional processing steps as part of the cuDF-based Parquet table scan implementation. First, Spark-RAPIDS parses the file footer and prunes out columns and row groups not needed by the table scan task. Then the plugin uses parquet-mr functions to apply the filter predicate to min/max stats in the footer, further pruning row groups. Next, Spark-RAPIDS completes IO for all the column chunks that remain, optionally decompresses these on the host, and then assembles a new parquet file in host pinned memory with a new footer. Then cuDF's chunked parquet reader materializes the data in chunks to remain with the task's memory foot print. Finally, Spark-RAPIDS applies the predicate filter row-wise.

This implementation has several inefficiencies, especially with high-selectivity table scans. First, the file footer is parsed twice in Spark-RAPIDS, written again for cuDF, and parsed by cuDF. Second, the IO for all column chunks is performed without checking bloom filters, dictionary pages, or the column index to see which data pages could actually pass the filter predicate. This means that excess IO, decompression and decoding will happen on data that ends up discarded during the row-wise predicate filtering.

Describe the solution you'd like
We should add new functions to an experimental namespace, exposing the steps needed to process a high-selectivity parquet table scan. These steps should minimize the total IO as much as possible, using metadata in the footer, column index, dictionary pages, bloom filters to avoid IO at the data page level. The steps should also be stateless, to give Spark-RAPIDS better control of spilling, chunking and retries.

1. Parse metadata from footer bytes.

Function get_valid_row_groups (needs GPU for stats based pruning):

Inputs
- Full footer bytes, including column index pages near file footer, stored in a pinned host buffer
- Parquet reader options
Outputs
- Parquet metadata, including column index data if present in footer
- Per-rowgroup validity BOOL8 columns, pruned with row group stats and requested row groups in parquet reader options

Do we need to prune columns from the parquet metadata? Or would that break the filter column indexing? I guess we could prune and reorder the columns and update the filter predicate…
This reader should only support files with column index data that is near the file footer and not support the older spec with column index data in the page headers.

2. Get byte ranges of secondary filters: dictionary pages and bloom filters

Function get_secondary_filters (no GPU):

Inputs
- Parquet metadata
- Parquet reader options
- Per-rowgroup validity BOOL8 columns
Outputs
- Byte ranges for dictionary pages (only filter columns and unpruned row groups)
- Byte ranges for bloom filters (only filter columns and unpruned row groups)

3. Prune valid row groups using secondary filters

Function prune_row_groups_by_dictionary_pages (needs GPU):

Inputs
- Parquet metadata
- Parquet reader options
- Per-rowgroup validity BOOL8 columns
- Compressed dictionary pages in pinned host buffer (libcudf chooses host or device decompression)
Outputs
- Updated Per-rowgroup validity BOOL8 columns

Function prune_row_groups_by_bloom_filters (needs GPU):

Inputs
- Parquet metadata
- Parquet reader options
- Per-rowgroup validity BOOL8 column
- Bloom filters in pinned host buffer
Outputs
- Updated Per-rowgroup validity BOOL8 column

Function update_validity_with_column_index (needs GPU)

Inputs
- Parquet metadata
- Parquet reader options
- Per-rowgroup validity BOOL8 column
Outputs
- ~~Updated per-page validity BOOL8 columns~~
- Update validity per-row validity BOOL8 column

4. Request filter column data pages

Function get_filter_columns_data_pages (no GPU):

Inputs
- Parquet metadata
- Parquet reader options
- ~~Per-page validity BOOL8 columns~~
- Per-row validity BOOL8 column
Outputs
- Byte ranges for filter column data pages (only unpruned pages from filter columns OR ranges for all column data pages; empty ranges for pruned columns)

Function materialize_filter_columns (needs GPU):

Inputs
- Parquet metadata
- Parquet reader options
- ~~Per-page validity BOOL8 columns~~
- Per-row validity BOOL8 column
- Compressed data pages in pinned host buffer (libcudf chooses host or device decompression)
Outputs
- Table containing Filter columns and null elsewhere
- Per-row validity BOOL8 column

Function apply_filter_predicate (needs GPU):

Inputs
- Parquet metadata
- Parquet reader options
- Per-row validity BOOL8 column
- Table containing Filter columns
Outputs
- Per-row validity BOOL8 columns
- Table containing Filter columns

5. Request payload column data pages

Function get_payload_columns_data_pages (no GPU):

Inputs
- Parquet metadata
- Parquet reader options
- Per-row validity BOOL8 columns
Outputs
- Byte ranges for payload column data pages (only pages with at least one valid row OR byte ranges for all pages; empty for pages with no valid row)

Function materialize_payload_columns (needs GPU):

Inputs
- Parquet metadata
- Parquet reader options
- Per-row validity BOOL8 columns
- Compressed data pages in pinned host buffer
Outputs
- Table containing Payload columns and null elsewhere

Additional context
Also see issues about refactoring the footer management in the "bulk reader" table scan (default in Spark-RAPIDS as of 25.04). #17716

The text was updated successfully, but these errors were encountered:

mhaseeb123 · 2025-02-01T00:55:04Z

The Parquet metadata here would be a struct similar to reader::impl::impl with some extra stuff to allow a stateless reader.

winningsix · 2025-02-05T06:56:11Z

It's great to have fine-grained APIs supporting 2 phase Parquet read.

Have two questions around dictionary encoding:

About dictionary encoded column, do we have an option to control whether it's materialized or not? It could be helpful to have late materialization especially for long string case.
Will we consider to support Dictionary filter in the first filter?

…l PQ reader (#17946) Related to #17896 This PR refactors Parquet reader's predicate pushdown to separate out row group pruning with stats, reading bloom filters, and row group pruning with bloom filters. This allows reusing corresponding functionalities in the experimental PQ reader for highly selective queries (Hybrid scan) as needed. Note that no code has been added or removed in this PR. Only moved around. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) URL: #17946

mythrocks · 2025-02-21T17:52:49Z

Have two questions around dictionary encoding...

@winningsix:
From conversations with @mhaseeb123, I think the plan for the first phase of implementation is not to filter pages based on dictionaries.

Contributes to #17896 This PR separates stats based filtering helpers for reuse in page pruning using stats in Parquet PageIndex. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) - Vyas Ramasubramani (https://github.com/vyasr) URL: #18034

GregoryKimball added cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Jan 31, 2025

GregoryKimball added this to the Parquet continuous improvement milestone Jan 31, 2025

GregoryKimball added this to libcudf Jan 31, 2025

GregoryKimball moved this to Story Issue in libcudf Jan 31, 2025

mhaseeb123 self-assigned this Feb 1, 2025

mhaseeb123 mentioned this issue Feb 7, 2025

Refactor predicate pushdown to reuse row group pruning in experimental PQ reader #17946

Merged

3 tasks

This was referenced Feb 14, 2025

🚧 Setup and implement row group pruning in the experimental Parquet reader #18011

Draft

Separate stats filtering helpers to reuse in page pruning #18034

Merged

GregoryKimball mentioned this issue Feb 19, 2025

[FEA] Add a custom decoder to convert Parquet dictionary pages into cuco::static_set objects #18046

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add a new Parquet reader for high-selectivity table scan #17896

[FEA] Add a new Parquet reader for high-selectivity table scan #17896

GregoryKimball commented Jan 31, 2025 •

edited

Loading

mhaseeb123 commented Feb 1, 2025 •

edited

Loading

winningsix commented Feb 5, 2025

mythrocks commented Feb 21, 2025

[FEA] Add a new Parquet reader for high-selectivity table scan #17896

[FEA] Add a new Parquet reader for high-selectivity table scan #17896

Comments

GregoryKimball commented Jan 31, 2025 • edited Loading

1. Parse metadata from footer bytes.

2. Get byte ranges of secondary filters: dictionary pages and bloom filters

3. Prune valid row groups using secondary filters

4. Request filter column data pages

5. Request payload column data pages

mhaseeb123 commented Feb 1, 2025 • edited Loading

winningsix commented Feb 5, 2025

mythrocks commented Feb 21, 2025

GregoryKimball commented Jan 31, 2025 •

edited

Loading

mhaseeb123 commented Feb 1, 2025 •

edited

Loading