Implement Predicate Filter and Pushdown API in orc-rust Similar to parquet-rs

I've been researching how `parquet-rs` implements predicate pushdown and would like to introduce a similar approach in `orc-rust`.  

### Predicate Pushdown in `parquet-rs`  

In `parquet-rs`, predicate pushdown is handled using a **selection mechanism**, allowing efficient filtering of records before they are fully loaded into Arrow arrays.  

#### Key Components:  

1. **Selection Mechanism (`VecDeque<RowSelector>`)**  
   Used in `ParquetRecordBatchReader` to track which records should be read or skipped.
   ```rust
    let front = selection.pop_front().unwrap();
    if front.skip {
        let skipped = match self.array_reader.skip_records(front.row_count) {
            Ok(skipped) => skipped,
            Err(e) => return Some(Err(e.into())),
        };
    
        if skipped != front.row_count {
            return Some(Err(general_err!(
                "failed to skip rows, expected {}, got {}",
                front.row_count,
                skipped
            )
            .into()));
        }
        continue;
    }
   ```  

2. **Batch Processing (`ArrayReader`)**  
   - Implements three key methods:  
     ```rust
     fn read_records(&mut self, batch_size: usize) -> Result<usize>;
     fn consume_batch(&mut self) -> Result<ArrayRef>;
     fn skip_records(&mut self, num_records: usize) -> Result<usize>;
     ```  
   - Allows efficient processing of batch data without unnecessary record loading.  

3. **Predicate Evaluation (`evaluate_predicate`)**  
   - **How it works:**  
     - Uses `array_reader` and an existing `selection` to construct a **temporary** `ParquetRecordBatchReader`.  
     - Iterates through **all batches**, applying the predicate to filter records.  
     - Generates a refined `selection` that determines which records should be included.  
     - A second pass over the Parquet file is performed using this refined `selection`, **skipping filtered records**.  
   - **Key advantage:**  
     - Since Arrow’s batch structure is immutable, filtering is done via `selection`, avoiding expensive in-place modifications.  
     - This mechanism can also be leveraged for **lazy materialization**.  

#### Code Example from `ParquetRecordBatchReaderBuilder::build()`  

```rust
if let Some(filter) = filter.as_mut() {
    for predicate in filter.predicates.iter_mut() {
        if !selects_any(selection.as_ref()) {
            break;
        }

        let array_reader =
            build_array_reader(self.fields.as_deref(), predicate.projection(), &reader)?;

        selection = Some(evaluate_predicate(
            batch_size,
            array_reader,
            selection,
            predicate.as_mut(),
        )?);
    }
}
```
### Proposal for `orc-rust`

I propose implementing a similar **predicate pushdown API** in `orc-rust`, which would:

1. **Introduce a selection mechanism** to allow skipping records early in the ORC read pipeline.  
2. **Implement an `ArrayReader` equivalent** in ORC to handle batch-wise filtering, skipping, and reading.  
3. **Modify ORC reader API** to accept predicate filters and process them in an **iterative, selection-based manner**.  
4. **Ensure efficient two-pass filtering**:  
   - **First pass:** Evaluate predicates on a temporary reader and compute `selection`.  
   - **Second pass:** Re-read ORC file using `selection`, skipping unnecessary records.  

Would love to get feedback on whether this approach makes sense for `orc-rust`, or if there are ORC-specific constraints that require adjustments.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Predicate Filter and Pushdown API in orc-rust Similar to parquet-rs #40

Predicate Pushdown in `parquet-rs`

Key Components:

Code Example from `ParquetRecordBatchReaderBuilder::build()`

Proposal for `orc-rust`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Predicate Filter and Pushdown API in orc-rust Similar to parquet-rs #40

Description

Predicate Pushdown in parquet-rs

Key Components:

Code Example from ParquetRecordBatchReaderBuilder::build()

Proposal for orc-rust

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Predicate Pushdown in `parquet-rs`

Code Example from `ParquetRecordBatchReaderBuilder::build()`

Proposal for `orc-rust`