Added arrow-avro schema resolution value skipping #8220

jecsand838 · 2025-08-25T17:42:24Z

Which issue does this PR close?

Part of Add Avro Support #4886
Follows up on Added arrow-avro schema resolution foundations and type promotion #8047

Rationale for this change

When reading Avro into Arrow with a projection or a reader schema that omits some writer fields, we were still decoding those writer‑only fields item‑by‑item. This is unnecessary work and can dominate CPU time for large arrays/maps or deeply nested records.

Avro’s binary format explicitly allows fast skipping for arrays/maps by encoding data in blocks: when the count is negative, the next long gives the byte size of the block, enabling O(1) skipping of that block without decoding each item. This PR teaches the record reader to recognize and leverage that, and to avoid constructing decoders for fields we will skip altogether.

What changes are included in this PR?

Reader / decoding architecture

Skip-aware record decoding:
- At construction time, we now precompute per-record skip decoders for writer fields that the reader will ignore.
- Introduced a resolved-record path (RecordResolved) that carries:
  - writer_to_reader mapping for field alignment,
  - a prebuilt list of skip decoders for fields not present in the reader,
  - the set of active per-field decoders for the projected fields.
Codec builder enhancements: In arrow-avro/src/codec.rs, record construction now:
- Builds Arrow Fields and their decoders only for fields that are read,
- Builds skip_decoders (via build_skip_decoders) for fields to ignore.
Error handling and consistency: Kept existing strict-mode behavior; improved internal branching to avoid inconsistent states during partial decodes.

Tests

Unit tests (in arrow-avro/src/reader/record.rs)
- Added focused tests that exercise the new skip logic:
  - Skipping writer‑only fields inside arrays and maps (including negative‑count block skipping and mixed multi‑block payloads).
  - Skipping nested structures within records to ensure offsets and lengths remain correct for the fields that are read.
  - Ensured nullability and union handling remain correct when adjacent fields are skipped.
Integration tests (in arrow-avro/src/reader/mod.rs)
- Added end‑to‑end test using avro/alltypes_plain.avro to validate that projecting a subset of fields (reader schema omits some writer fields) both:
  - Produces the correct Arrow arrays for the selected fields, and
  - Avoids decoding skipped fields (validated indirectly via behavior and block boundaries).
- The test covers compressed and uncompressed variants already present in the suite to ensure behavior is consistent across codecs.

Are these changes tested?

New unit tests cover:
- Fast skipping for arrays/maps using negative block counts and block sizes (per Avro spec).
- Nested and nullable scenarios to ensure correct offsets, validity bitmaps, and flush behavior when adjacent fields are skipped.
New integration test in reader/mod.rs:
- Reads avro/alltypes_plain.avro with a reader schema that omits several writer fields and asserts the resulting RecordBatch matches the expected arrays while exercising the skip path.
Existing promotion, enum, decimal, fixed, and union tests continue to pass, ensuring no regressions in unrelated areas.

Are there any user-facing changes?

N/A since arrow-avro is not public yet.

…tion. - Added skipping logic for writer-only fields in `RecordDecoder`. - Introduced `ResolvedRuntime` for runtime decoding adjustments. - Updated tests to validate skipping functionality. - Refactored block-wise processing for optimized performance.

alamb · 2025-08-26T11:01:25Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing avro-schema-resolution-skip-values (9f35502) to a620957 diff
BENCH_NAME=avro_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench avro_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=avro-schema-resolution-skip-values
Results will be posted here when complete

alamb · 2025-08-26T11:02:51Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing avro-schema-resolution-skip-values (9f35502) to a620957 diff
BENCH_NAME=avro_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench avro_reader
BENCH_FILTER=
BENCH_BRANCH_NAME=avro-schema-resolution-skip-values
Results will be posted here when complete

alamb · 2025-08-26T11:08:44Z

🤖: Benchmark completed

Details

group                                              avro-schema-resolution-skip-values     main
-----                                              ----------------------------------     ----
array_creation/string_array_1000_chars             1.00     58.6±0.31µs        ? ?/sec    1.26     74.0±0.12µs        ? ?/sec
array_creation/string_array_100_chars              1.01      9.5±0.01µs        ? ?/sec    1.00      9.3±0.04µs        ? ?/sec
array_creation/string_array_10_chars               1.01      6.5±0.01µs        ? ?/sec    1.00      6.5±0.01µs        ? ?/sec
array_creation/string_view_1000_chars              1.00     67.5±0.79µs        ? ?/sec    1.05     70.6±3.12µs        ? ?/sec
array_creation/string_view_100_chars               1.01     10.8±0.01µs        ? ?/sec    1.00     10.7±0.10µs        ? ?/sec
array_creation/string_view_10_chars                1.00      7.8±0.01µs        ? ?/sec    1.00      7.8±0.01µs        ? ?/sec
avro_reader/string_array_1000_chars                1.00    379.4±4.28µs        ? ?/sec    1.01    384.3±3.62µs        ? ?/sec
avro_reader/string_array_100_chars                 1.00     81.9±0.09µs        ? ?/sec    1.01     82.5±0.17µs        ? ?/sec
avro_reader/string_array_10_chars                  1.00     61.6±0.09µs        ? ?/sec    1.00     61.8±0.13µs        ? ?/sec
avro_reader/string_view_1000_chars                 1.00    341.0±2.53µs        ? ?/sec    1.04    356.1±3.05µs        ? ?/sec
avro_reader/string_view_100_chars                  1.00     83.9±0.12µs        ? ?/sec    1.00     84.3±0.42µs        ? ?/sec
avro_reader/string_view_10_chars                   1.00     63.0±0.06µs        ? ?/sec    1.00     63.1±0.11µs        ? ?/sec
string_operations/string_array_value_1000_chars    1.01    246.0±0.11ns        ? ?/sec    1.00    242.9±1.07ns        ? ?/sec
string_operations/string_array_value_100_chars     1.00    245.7±0.13ns        ? ?/sec    1.00    246.1±0.17ns        ? ?/sec
string_operations/string_array_value_10_chars      1.00    244.8±0.15ns        ? ?/sec    1.01    246.4±0.22ns        ? ?/sec
string_operations/string_view_value_1000_chars     1.00   1072.3±1.25ns        ? ?/sec    1.00   1076.0±5.81ns        ? ?/sec
string_operations/string_view_value_100_chars      1.00   1071.6±0.49ns        ? ?/sec    1.00   1072.7±1.85ns        ? ?/sec
string_operations/string_view_value_10_chars       1.00   1072.2±1.16ns        ? ?/sec    1.00   1073.3±0.57ns        ? ?/sec

alamb

THank you @jecsand838 -- the code looks good to me. I am a little worried about lack of test coverage -- is there any chance you can add coverage for skipping more of the types?

I think "end to end" type skipping tests would be the best. Maybe something like

Write a file with all supported avro types
Read each (single) column back (skipping all the others)
Verify the output column is the same as was written.

alamb · 2025-08-26T10:51:17Z

arrow-avro/src/reader/mod.rs

@@ -1537,6 +1564,57 @@ mod test {
        assert!(batch.column(0).as_any().is::<StringViewArray>());
    }

+    #[test]
+    fn test_alltypes_skip_writer_fields_keep_double_only() {
+        let file = arrow_test_data("avro/alltypes_plain.avro");


It is so cool to me to see the files added by @Igosuki in Add basic AVRO files (translated copies of the parquet testing files to avro) arrow-testing#62 keep paying off / are used

100% They are solid files.

alamb · 2025-08-26T10:55:52Z

arrow-avro/src/reader/record.rs

@@ -736,6 +858,166 @@ fn sign_extend_to<const N: usize>(raw: &[u8]) -> Result<[u8; N], ArrowError> {
    Ok(arr)
 }

+/// Lightweight skipping decoder for writer-only fields


I found the term "writer only" field somewhat confusing -- I think the same concept (not decoding fields into arrow that are not requested) is called "non-projected fields" in the parquet, json, and csv readers.

I think the name skipper is quite clear, this is just a high level comment about the terminology in the comments (I know, 🙄 )

That's a good callout! I can definitely see where the confusion stems from.

I just updated the comment to read like this:

/// Lightweight skipper for non‑projected writer fields /// (fields present in the writer schema but omitted by the reader/projection); /// per Avro 1.11.1 schema resolution these fields are ignored. /// /// <https://avro.apache.org/docs/1.11.1/specification/#schema-resolution>

Let me know if that's more clear. I fully agree that comments / documentation need to be straightforward and consistent in terminology and language across the project.

alamb · 2025-08-26T11:07:57Z

arrow-avro/src/reader/record.rs

@@ -1471,4 +1753,196 @@ mod tests {
        assert!(int_array.is_null(0)); // row1 is null
        assert_eq!(int_array.value(1), 42); // row3 value is 42
    }
+
+    fn make_record_resolved_decoder(


I didn't fully follow these tests. but I didn't find any coverage for skipping the nested types (Lists, Maps, Structs).

I ran llvm-cov to double check and it seems to imply this code isn't tested:

cargo llvm-cov --html -p arrow-avro

Report is here: coverage.zip

For example
coverage/Users/andrewlamb/Software/arrow-rs/arrow-avro/src/reader/record.rs.html

@alamb I'm planning to include Maker::resolve_type method branches for the complex and logical types over my next few PRs (for example the enum mapping PR I put up has support for the Enum branch). These branches just have additional logic and I didn't want to balloon this PR.

That being said I probably should have included placeholders and then in turn tests able to reach the Skipper::from_avro method as a part of this PR. That's definitely my mistake.

So I went ahead and added those placeholder branches and created an Avro file that covers every type currently supported by arrow-avro using this python script: https://gist.github.com/jecsand838/82d9874a5f9be8a636dcd49ad9b8e237

Then I added a new test_skippable_types_project_each_field_individually test to the arrow-avro/src/reader/mod.rs file. This test behaves as you recommended in your other comment. Once the arrow-avro Writer has full type support, we can move towards a round trip approach as well. However the changes I just pushed up should include coverage for skipping each of those types now.

Thank you for catching this and calling it out!

jecsand838 · 2025-08-26T18:37:32Z

THank you @jecsand838 -- the code looks good to me. I am a little worried about lack of test coverage -- is there any chance you can add coverage for skipping more of the types?

I think "end to end" type skipping tests would be the best. Maybe something like

Write a file with all supported avro types

Read each (single) column back (skipping all the others)

Verify the output column is the same as was written.

That's a good callout! I can definitely do that.

jecsand838 · 2025-08-26T21:53:27Z

@alamb I appreciate the solid review! I went ahead and pushed up changes that should address your feedback. Let me know what you think when you get a second.

Also I created a PR in the arrow-testing project for the new arrow-avro/test/data/skippable_types.avro file I created: apache/arrow-testing#111

My general plan for these test files is to move them out of the arrow-avro/test/data as they get accepted in arrow-testing.

github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Aug 25, 2025

jecsand838 force-pushed the avro-schema-resolution-skip-values branch from f7fd11b to 9f35502 Compare August 25, 2025 22:43

alamb reviewed Aug 26, 2025

View reviewed changes

Address PR Comments

1ed7efa

jecsand838 requested a review from alamb August 26, 2025 21:53

Merge branch 'main' into avro-schema-resolution-skip-values

9423bf1

jecsand838 force-pushed the avro-schema-resolution-skip-values branch from 66d6163 to 9423bf1 Compare August 27, 2025 16:59

jecsand838 mentioned this pull request Aug 30, 2025

[Avro] Decoder panics on flush when schema contains map whose value is non-nullable #8253

Open

jecsand838 force-pushed the avro-schema-resolution-skip-values branch 2 times, most recently from 91fb2c7 to 54cb130 Compare August 30, 2025 19:47

cleaned up skip_blocks method.

ebf4029

jecsand838 force-pushed the avro-schema-resolution-skip-values branch from 54cb130 to ebf4029 Compare August 30, 2025 19:50

jecsand838 mentioned this pull request Aug 30, 2025

[Avro] Support map with non-nullable value type #8254

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added arrow-avro schema resolution value skipping #8220

Added arrow-avro schema resolution value skipping #8220

jecsand838 commented Aug 25, 2025 •

edited

Loading

Uh oh!

alamb commented Aug 26, 2025

Uh oh!

alamb commented Aug 26, 2025

Uh oh!

alamb commented Aug 26, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Aug 26, 2025

Uh oh!

jecsand838 Aug 26, 2025

Uh oh!

alamb Aug 26, 2025

Uh oh!

jecsand838 Aug 26, 2025

Uh oh!

alamb Aug 26, 2025

Uh oh!

jecsand838 Aug 26, 2025 •

edited

Loading

Uh oh!

jecsand838 commented Aug 26, 2025

Uh oh!

jecsand838 commented Aug 26, 2025

Uh oh!

Uh oh!

Added arrow-avro schema resolution value skipping #8220

Are you sure you want to change the base?

Added arrow-avro schema resolution value skipping #8220

Conversation

jecsand838 commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Aug 26, 2025

Uh oh!

alamb commented Aug 26, 2025

Uh oh!

alamb commented Aug 26, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

jecsand838 Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

jecsand838 Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

jecsand838 Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 commented Aug 26, 2025

Uh oh!

jecsand838 commented Aug 26, 2025

Uh oh!

Uh oh!

jecsand838 commented Aug 25, 2025 •

edited

Loading

jecsand838 Aug 26, 2025 •

edited

Loading