Open
Description
I've been researching how parquet-rs
implements predicate pushdown and would like to introduce a similar approach in orc-rust
.
Predicate Pushdown in parquet-rs
In parquet-rs
, predicate pushdown is handled using a selection mechanism, allowing efficient filtering of records before they are fully loaded into Arrow arrays.
Key Components:
-
Selection Mechanism (
VecDeque<RowSelector>
)
Used inParquetRecordBatchReader
to track which records should be read or skipped.let front = selection.pop_front().unwrap(); if front.skip { let skipped = match self.array_reader.skip_records(front.row_count) { Ok(skipped) => skipped, Err(e) => return Some(Err(e.into())), }; if skipped != front.row_count { return Some(Err(general_err!( "failed to skip rows, expected {}, got {}", front.row_count, skipped ) .into())); } continue; }
-
Batch Processing (
ArrayReader
)- Implements three key methods:
fn read_records(&mut self, batch_size: usize) -> Result<usize>; fn consume_batch(&mut self) -> Result<ArrayRef>; fn skip_records(&mut self, num_records: usize) -> Result<usize>;
- Allows efficient processing of batch data without unnecessary record loading.
- Implements three key methods:
-
Predicate Evaluation (
evaluate_predicate
)- How it works:
- Uses
array_reader
and an existingselection
to construct a temporaryParquetRecordBatchReader
. - Iterates through all batches, applying the predicate to filter records.
- Generates a refined
selection
that determines which records should be included. - A second pass over the Parquet file is performed using this refined
selection
, skipping filtered records.
- Uses
- Key advantage:
- Since Arrow’s batch structure is immutable, filtering is done via
selection
, avoiding expensive in-place modifications. - This mechanism can also be leveraged for lazy materialization.
- Since Arrow’s batch structure is immutable, filtering is done via
- How it works:
Code Example from ParquetRecordBatchReaderBuilder::build()
if let Some(filter) = filter.as_mut() {
for predicate in filter.predicates.iter_mut() {
if !selects_any(selection.as_ref()) {
break;
}
let array_reader =
build_array_reader(self.fields.as_deref(), predicate.projection(), &reader)?;
selection = Some(evaluate_predicate(
batch_size,
array_reader,
selection,
predicate.as_mut(),
)?);
}
}
Proposal for orc-rust
I propose implementing a similar predicate pushdown API in orc-rust
, which would:
- Introduce a selection mechanism to allow skipping records early in the ORC read pipeline.
- Implement an
ArrayReader
equivalent in ORC to handle batch-wise filtering, skipping, and reading. - Modify ORC reader API to accept predicate filters and process them in an iterative, selection-based manner.
- Ensure efficient two-pass filtering:
- First pass: Evaluate predicates on a temporary reader and compute
selection
. - Second pass: Re-read ORC file using
selection
, skipping unnecessary records.
- First pass: Evaluate predicates on a temporary reader and compute
Would love to get feedback on whether this approach makes sense for orc-rust
, or if there are ORC-specific constraints that require adjustments.
Metadata
Metadata
Assignees
Labels
No labels