You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been researching how parquet-rs implements predicate pushdown and would like to introduce a similar approach in orc-rust.
Predicate Pushdown in parquet-rs
In parquet-rs, predicate pushdown is handled using a selection mechanism, allowing efficient filtering of records before they are fully loaded into Arrow arrays.
Key Components:
Selection Mechanism (VecDeque<RowSelector>)
Used in ParquetRecordBatchReader to track which records should be read or skipped.
let front = selection.pop_front().unwrap();if front.skip{let skipped = matchself.array_reader.skip_records(front.row_count){Ok(skipped) => skipped,Err(e) => returnSome(Err(e.into())),};if skipped != front.row_count{returnSome(Err(general_err!("failed to skip rows, expected {}, got {}",
front.row_count,
skipped
).into()));}continue;}
I've been researching how
parquet-rs
implements predicate pushdown and would like to introduce a similar approach inorc-rust
.Predicate Pushdown in
parquet-rs
In
parquet-rs
, predicate pushdown is handled using a selection mechanism, allowing efficient filtering of records before they are fully loaded into Arrow arrays.Key Components:
Selection Mechanism (
VecDeque<RowSelector>
)Used in
ParquetRecordBatchReader
to track which records should be read or skipped.Batch Processing (
ArrayReader
)Predicate Evaluation (
evaluate_predicate
)array_reader
and an existingselection
to construct a temporaryParquetRecordBatchReader
.selection
that determines which records should be included.selection
, skipping filtered records.selection
, avoiding expensive in-place modifications.Code Example from
ParquetRecordBatchReaderBuilder::build()
Proposal for
orc-rust
I propose implementing a similar predicate pushdown API in
orc-rust
, which would:ArrayReader
equivalent in ORC to handle batch-wise filtering, skipping, and reading.selection
.selection
, skipping unnecessary records.Would love to get feedback on whether this approach makes sense for
orc-rust
, or if there are ORC-specific constraints that require adjustments.The text was updated successfully, but these errors were encountered: