Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add a new Parquet reader for high-selectivity table scan #17896

Open
GregoryKimball opened this issue Jan 31, 2025 · 2 comments
Open

[FEA] Add a new Parquet reader for high-selectivity table scan #17896

GregoryKimball opened this issue Jan 31, 2025 · 2 comments
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jan 31, 2025

Is your feature request related to a problem? Please describe.
As of 25.04, Spark-RAPIDS performs several additional processing steps as part of the cuDF-based Parquet table scan implementation. First, Spark-RAPIDS parses the file footer and prunes out columns and row groups not needed by the table scan task. Then the plugin uses parquet-mr functions to apply the filter predicate to min/max stats in the footer, further pruning row groups. Next, Spark-RAPIDS completes IO for all the column chunks that remain, optionally decompresses these on the host, and then assembles a new parquet file in host pinned memory with a new footer. Then cuDF's chunked parquet reader materializes the data in chunks to remain with the task's memory foot print. Finally, Spark-RAPIDS applies the predicate filter row-wise.

This implementation has several inefficiencies, especially with high-selectivity table scans. First, the file footer is parsed twice in Spark-RAPIDS, written again for cuDF, and parsed by cuDF. Second, the IO for all column chunks is performed without checking bloom filters, dictionary pages, or the column index to see which data pages could actually pass the filter predicate. This means that excess IO, decompression and decoding will happen on data that ends up discarded during the row-wise predicate filtering.

Describe the solution you'd like
We should add new functions to an experimental namespace, exposing the steps needed to process a high-selectivity parquet table scan. These steps should minimize the total IO as much as possible, using metadata in the footer, column index, dictionary pages, bloom filters to avoid IO at the data page level. The steps should also be stateless, to give Spark-RAPIDS better control of spilling, chunking and retries.

1. Parse metadata from footer bytes.

Function get_valid_row_groups (needs GPU for stats based pruning):

  • Inputs
    • Full footer bytes, including column index pages near file footer, stored in a pinned host buffer
    • Parquet reader options
  • Outputs
    • Parquet metadata, including column index data if present in footer
    • Per-rowgroup validity BOOL8 columns, pruned with row group stats and requested row groups in parquet reader options

Do we need to prune columns from the parquet metadata? Or would that break the filter column indexing? I guess we could prune and reorder the columns and update the filter predicate…
This reader should only support files with column index data that is near the file footer and not support the older spec with column index data in the page headers.

2. Get byte ranges of secondary filters: dictionary pages and bloom filters

Function get_secondary_filters (no GPU):

  • Inputs
    • Parquet metadata
    • Parquet reader options
    • Per-rowgroup validity BOOL8 columns
  • Outputs
    • Byte ranges for dictionary pages (only filter columns and unpruned row groups)
    • Byte ranges for bloom filters (only filter columns and unpruned row groups)

3. Prune valid row groups using secondary filters

Function prune_row_groups_by_dictionary_pages (needs GPU):

  • Inputs
    • Parquet metadata
    • Parquet reader options
    • Per-rowgroup validity BOOL8 columns
    • Compressed dictionary pages in pinned host buffer (libcudf chooses host or device decompression)
  • Outputs
    • Updated Per-rowgroup validity BOOL8 columns

Function prune_row_groups_by_bloom_filters (needs GPU):

  • Inputs
    • Parquet metadata
    • Parquet reader options
    • Per-rowgroup validity BOOL8 column
    • Bloom filters in pinned host buffer
  • Outputs
    • Updated Per-rowgroup validity BOOL8 column

Function update_validity_with_column_index (needs GPU)

  • Inputs
    • Parquet metadata
    • Parquet reader options
    • Per-rowgroup validity BOOL8 column
  • Outputs
    • Updated per-page validity BOOL8 columns
    • Update validity per-row validity BOOL8 column

4. Request filter column data pages

Function get_filter_columns_data_pages (no GPU):

  • Inputs
    • Parquet metadata
    • Parquet reader options
    • Per-page validity BOOL8 columns
    • Per-row validity BOOL8 column
  • Outputs
    • Byte ranges for filter column data pages (only unpruned pages from filter columns OR ranges for all column data pages; empty ranges for pruned columns)

Function materialize_filter_columns (needs GPU):

  • Inputs
    • Parquet metadata
    • Parquet reader options
    • Per-page validity BOOL8 columns
    • Per-row validity BOOL8 column
    • Compressed data pages in pinned host buffer (libcudf chooses host or device decompression)
  • Outputs
    • Table containing Filter columns and null elsewhere

Function apply_filter_predicate (needs GPU):

  • Inputs
    • Parquet metadata
    • Parquet reader options
    • Per-row validity BOOL8 column
    • Table containing Filter columns
  • Outputs
    • Per-row validity BOOL8 columns
    • Table containing Filter columns

5. Request payload column data pages

Function get_payload_columns_data_pages (no GPU):

  • Inputs
    • Parquet metadata
    • Parquet reader options
    • Per-row validity BOOL8 columns
  • Outputs
    • Byte ranges for payload column data pages (only pages with at least one valid row OR byte ranges for all pages; empty for pages with no valid row)

Function materialize_payload_columns (needs GPU):

  • Inputs
    • Parquet metadata
    • Parquet reader options
    • Per-row validity BOOL8 columns
    • Compressed data pages in pinned host buffer
  • Outputs
    • Table containing Payload columns and null elsewhere

Additional context
Also see issues about refactoring the footer management in the "bulk reader" table scan (default in Spark-RAPIDS as of 25.04). #17716

@GregoryKimball GregoryKimball added cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Jan 31, 2025
@GregoryKimball GregoryKimball moved this to Story Issue in libcudf Jan 31, 2025
@mhaseeb123 mhaseeb123 self-assigned this Feb 1, 2025
@mhaseeb123
Copy link
Member

mhaseeb123 commented Feb 1, 2025

The Parquet metadata here would be a struct similar to reader::impl::impl with some extra stuff to allow a stateless reader.

@winningsix
Copy link
Contributor

It's great to have fine-grained APIs supporting 2 phase Parquet read.

Have two questions around dictionary encoding:

  • About dictionary encoded column, do we have an option to control whether it's materialized or not? It could be helpful to have late materialization especially for long string case.
  • Will we consider to support Dictionary filter in the first filter?

rapids-bot bot pushed a commit that referenced this issue Feb 13, 2025
…l PQ reader (#17946)

Related to #17896

This PR refactors Parquet reader's predicate pushdown to separate out row group pruning with stats, reading bloom filters, and row group pruning with bloom filters. This allows reusing corresponding functionalities in the experimental PQ reader for highly selective queries (Hybrid scan) as needed.

Note that no code has been added or removed in this PR. Only moved around.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #17946
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: Story Issue
Development

No branches or pull requests

3 participants