Skip to content

Conversation

jecsand838
Copy link
Contributor

@jecsand838 jecsand838 commented Aug 25, 2025

Which issue does this PR close?

Rationale for this change

When reading Avro into Arrow with a projection or a reader schema that omits some writer fields, we were still decoding those writer‑only fields item‑by‑item. This is unnecessary work and can dominate CPU time for large arrays/maps or deeply nested records.

Avro’s binary format explicitly allows fast skipping for arrays/maps by encoding data in blocks: when the count is negative, the next long gives the byte size of the block, enabling O(1) skipping of that block without decoding each item. This PR teaches the record reader to recognize and leverage that, and to avoid constructing decoders for fields we will skip altogether.

What changes are included in this PR?

Reader / decoding architecture

  • Skip-aware record decoding:
    • At construction time, we now precompute per-record skip decoders for writer fields that the reader will ignore.
    • Introduced a resolved-record path (RecordResolved) that carries:
      • writer_to_reader mapping for field alignment,
      • a prebuilt list of skip decoders for fields not present in the reader,
      • the set of active per-field decoders for the projected fields.
  • Codec builder enhancements: In arrow-avro/src/codec.rs, record construction now:
    • Builds Arrow Fields and their decoders only for fields that are read,
    • Builds skip_decoders (via build_skip_decoders) for fields to ignore.
  • Error handling and consistency: Kept existing strict-mode behavior; improved internal branching to avoid inconsistent states during partial decodes.

Tests

  • Unit tests (in arrow-avro/src/reader/record.rs)
    • Added focused tests that exercise the new skip logic:
      • Skipping writer‑only fields inside arrays and maps (including negative‑count block skipping and mixed multi‑block payloads).
      • Skipping nested structures within records to ensure offsets and lengths remain correct for the fields that are read.
      • Ensured nullability and union handling remain correct when adjacent fields are skipped.
  • Integration tests (in arrow-avro/src/reader/mod.rs)
    • Added end‑to‑end test using avro/alltypes_plain.avro to validate that projecting a subset of fields (reader schema omits some writer fields) both:
      • Produces the correct Arrow arrays for the selected fields, and
      • Avoids decoding skipped fields (validated indirectly via behavior and block boundaries).
    • The test covers compressed and uncompressed variants already present in the suite to ensure behavior is consistent across codecs.

Are these changes tested?

  • New unit tests cover:
    • Fast skipping for arrays/maps using negative block counts and block sizes (per Avro spec).
    • Nested and nullable scenarios to ensure correct offsets, validity bitmaps, and flush behavior when adjacent fields are skipped.
  • New integration test in reader/mod.rs:
    • Reads avro/alltypes_plain.avro with a reader schema that omits several writer fields and asserts the resulting RecordBatch matches the expected arrays while exercising the skip path.
  • Existing promotion, enum, decimal, fixed, and union tests continue to pass, ensuring no regressions in unrelated areas.

Are there any user-facing changes?

N/A since arrow-avro is not public yet.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Aug 25, 2025
…tion.

- Added skipping logic for writer-only fields in `RecordDecoder`.
- Introduced `ResolvedRuntime` for runtime decoding adjustments.
- Updated tests to validate skipping functionality.
- Refactored block-wise processing for optimized performance.
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-skip-values branch from f7fd11b to 9f35502 Compare August 25, 2025 22:43
@alamb
Copy link
Contributor

alamb commented Aug 26, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing avro-schema-resolution-skip-values (9f35502) to a620957 diff
BENCH_NAME=avro_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench avro_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=avro-schema-resolution-skip-values
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Aug 26, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing avro-schema-resolution-skip-values (9f35502) to a620957 diff
BENCH_NAME=avro_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench avro_reader
BENCH_FILTER=
BENCH_BRANCH_NAME=avro-schema-resolution-skip-values
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Aug 26, 2025

🤖: Benchmark completed

Details

group                                              avro-schema-resolution-skip-values     main
-----                                              ----------------------------------     ----
array_creation/string_array_1000_chars             1.00     58.6±0.31µs        ? ?/sec    1.26     74.0±0.12µs        ? ?/sec
array_creation/string_array_100_chars              1.01      9.5±0.01µs        ? ?/sec    1.00      9.3±0.04µs        ? ?/sec
array_creation/string_array_10_chars               1.01      6.5±0.01µs        ? ?/sec    1.00      6.5±0.01µs        ? ?/sec
array_creation/string_view_1000_chars              1.00     67.5±0.79µs        ? ?/sec    1.05     70.6±3.12µs        ? ?/sec
array_creation/string_view_100_chars               1.01     10.8±0.01µs        ? ?/sec    1.00     10.7±0.10µs        ? ?/sec
array_creation/string_view_10_chars                1.00      7.8±0.01µs        ? ?/sec    1.00      7.8±0.01µs        ? ?/sec
avro_reader/string_array_1000_chars                1.00    379.4±4.28µs        ? ?/sec    1.01    384.3±3.62µs        ? ?/sec
avro_reader/string_array_100_chars                 1.00     81.9±0.09µs        ? ?/sec    1.01     82.5±0.17µs        ? ?/sec
avro_reader/string_array_10_chars                  1.00     61.6±0.09µs        ? ?/sec    1.00     61.8±0.13µs        ? ?/sec
avro_reader/string_view_1000_chars                 1.00    341.0±2.53µs        ? ?/sec    1.04    356.1±3.05µs        ? ?/sec
avro_reader/string_view_100_chars                  1.00     83.9±0.12µs        ? ?/sec    1.00     84.3±0.42µs        ? ?/sec
avro_reader/string_view_10_chars                   1.00     63.0±0.06µs        ? ?/sec    1.00     63.1±0.11µs        ? ?/sec
string_operations/string_array_value_1000_chars    1.01    246.0±0.11ns        ? ?/sec    1.00    242.9±1.07ns        ? ?/sec
string_operations/string_array_value_100_chars     1.00    245.7±0.13ns        ? ?/sec    1.00    246.1±0.17ns        ? ?/sec
string_operations/string_array_value_10_chars      1.00    244.8±0.15ns        ? ?/sec    1.01    246.4±0.22ns        ? ?/sec
string_operations/string_view_value_1000_chars     1.00   1072.3±1.25ns        ? ?/sec    1.00   1076.0±5.81ns        ? ?/sec
string_operations/string_view_value_100_chars      1.00   1071.6±0.49ns        ? ?/sec    1.00   1072.7±1.85ns        ? ?/sec
string_operations/string_view_value_10_chars       1.00   1072.2±1.16ns        ? ?/sec    1.00   1073.3±0.57ns        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THank you @jecsand838 -- the code looks good to me. I am a little worried about lack of test coverage -- is there any chance you can add coverage for skipping more of the types?

I think "end to end" type skipping tests would be the best. Maybe something like

  1. Write a file with all supported avro types
  2. Read each (single) column back (skipping all the others)
  3. Verify the output column is the same as was written.

@@ -1537,6 +1564,57 @@ mod test {
assert!(batch.column(0).as_any().is::<StringViewArray>());
}

#[test]
fn test_alltypes_skip_writer_fields_keep_double_only() {
let file = arrow_test_data("avro/alltypes_plain.avro");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% They are solid files.

@@ -736,6 +858,166 @@ fn sign_extend_to<const N: usize>(raw: &[u8]) -> Result<[u8; N], ArrowError> {
Ok(arr)
}

/// Lightweight skipping decoder for writer-only fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the term "writer only" field somewhat confusing -- I think the same concept (not decoding fields into arrow that are not requested) is called "non-projected fields" in the parquet, json, and csv readers.

I think the name skipper is quite clear, this is just a high level comment about the terminology in the comments (I know, 🙄 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good callout! I can definitely see where the confusion stems from.

I just updated the comment to read like this:

/// Lightweight skipper for non‑projected writer fields
/// (fields present in the writer schema but omitted by the reader/projection);
/// per Avro 1.11.1 schema resolution these fields are ignored.
///
/// <https://avro.apache.org/docs/1.11.1/specification/#schema-resolution>

Let me know if that's more clear. I fully agree that comments / documentation need to be straightforward and consistent in terminology and language across the project.

@@ -1471,4 +1753,196 @@ mod tests {
assert!(int_array.is_null(0)); // row1 is null
assert_eq!(int_array.value(1), 42); // row3 value is 42
}

fn make_record_resolved_decoder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't fully follow these tests. but I didn't find any coverage for skipping the nested types (Lists, Maps, Structs).

I ran llvm-cov to double check and it seems to imply this code isn't tested:

cargo llvm-cov --html -p arrow-avro

Report is here: coverage.zip

For example
coverage/Users/andrewlamb/Software/arrow-rs/arrow-avro/src/reader/record.rs.html

Screenshot 2025-08-26 at 7 05 43 AM

Copy link
Contributor Author

@jecsand838 jecsand838 Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I'm planning to include Maker::resolve_type method branches for the complex and logical types over my next few PRs (for example the enum mapping PR I put up has support for the Enum branch). These branches just have additional logic and I didn't want to balloon this PR.

That being said I probably should have included placeholders and then in turn tests able to reach the Skipper::from_avro method as a part of this PR. That's definitely my mistake.

So I went ahead and added those placeholder branches and created an Avro file that covers every type currently supported by arrow-avro using this python script: https://gist.github.com/jecsand838/82d9874a5f9be8a636dcd49ad9b8e237

Then I added a new test_skippable_types_project_each_field_individually test to the arrow-avro/src/reader/mod.rs file. This test behaves as you recommended in your other comment. Once the arrow-avro Writer has full type support, we can move towards a round trip approach as well. However the changes I just pushed up should include coverage for skipping each of those types now.

Thank you for catching this and calling it out!

@jecsand838
Copy link
Contributor Author

THank you @jecsand838 -- the code looks good to me. I am a little worried about lack of test coverage -- is there any chance you can add coverage for skipping more of the types?

I think "end to end" type skipping tests would be the best. Maybe something like

  1. Write a file with all supported avro types
  2. Read each (single) column back (skipping all the others)
  3. Verify the output column is the same as was written.

That's a good callout! I can definitely do that.

@jecsand838
Copy link
Contributor Author

@alamb I appreciate the solid review! I went ahead and pushed up changes that should address your feedback. Let me know what you think when you get a second.

Also I created a PR in the arrow-testing project for the new arrow-avro/test/data/skippable_types.avro file I created: apache/arrow-testing#111

My general plan for these test files is to move them out of the arrow-avro/test/data as they get accepted in arrow-testing.

@jecsand838 jecsand838 requested a review from alamb August 26, 2025 21:53
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-skip-values branch from 66d6163 to 9423bf1 Compare August 27, 2025 16:59
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-skip-values branch 2 times, most recently from 91fb2c7 to 54cb130 Compare August 30, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-avro arrow-avro crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants