-
Notifications
You must be signed in to change notification settings - Fork 995
Added arrow-avro schema resolution value skipping #8220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Added arrow-avro schema resolution value skipping #8220
Conversation
…tion. - Added skipping logic for writer-only fields in `RecordDecoder`. - Introduced `ResolvedRuntime` for runtime decoding adjustments. - Updated tests to validate skipping functionality. - Refactored block-wise processing for optimized performance.
f7fd11b
to
9f35502
Compare
🤖 |
🤖 |
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THank you @jecsand838 -- the code looks good to me. I am a little worried about lack of test coverage -- is there any chance you can add coverage for skipping more of the types?
I think "end to end" type skipping tests would be the best. Maybe something like
- Write a file with all supported avro types
- Read each (single) column back (skipping all the others)
- Verify the output column is the same as was written.
@@ -1537,6 +1564,57 @@ mod test { | |||
assert!(batch.column(0).as_any().is::<StringViewArray>()); | |||
} | |||
|
|||
#[test] | |||
fn test_alltypes_skip_writer_fields_keep_double_only() { | |||
let file = arrow_test_data("avro/alltypes_plain.avro"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It is so cool to me to see the files added by @Igosuki in Add basic AVRO files (translated copies of the parquet testing files to avro) arrow-testing#62 keep paying off / are used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% They are solid files.
arrow-avro/src/reader/record.rs
Outdated
@@ -736,6 +858,166 @@ fn sign_extend_to<const N: usize>(raw: &[u8]) -> Result<[u8; N], ArrowError> { | |||
Ok(arr) | |||
} | |||
|
|||
/// Lightweight skipping decoder for writer-only fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the term "writer only" field somewhat confusing -- I think the same concept (not decoding fields into arrow that are not requested) is called "non-projected fields" in the parquet, json, and csv readers.
I think the name skipper
is quite clear, this is just a high level comment about the terminology in the comments (I know, 🙄 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good callout! I can definitely see where the confusion stems from.
I just updated the comment to read like this:
/// Lightweight skipper for non‑projected writer fields
/// (fields present in the writer schema but omitted by the reader/projection);
/// per Avro 1.11.1 schema resolution these fields are ignored.
///
/// <https://avro.apache.org/docs/1.11.1/specification/#schema-resolution>
Let me know if that's more clear. I fully agree that comments / documentation need to be straightforward and consistent in terminology and language across the project.
@@ -1471,4 +1753,196 @@ mod tests { | |||
assert!(int_array.is_null(0)); // row1 is null | |||
assert_eq!(int_array.value(1), 42); // row3 value is 42 | |||
} | |||
|
|||
fn make_record_resolved_decoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't fully follow these tests. but I didn't find any coverage for skipping the nested types (Lists, Maps, Structs).
I ran llvm-cov
to double check and it seems to imply this code isn't tested:
cargo llvm-cov --html -p arrow-avro
Report is here: coverage.zip
For example
coverage/Users/andrewlamb/Software/arrow-rs/arrow-avro/src/reader/record.rs.html

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I'm planning to include Maker::resolve_type
method branches for the complex and logical types over my next few PRs (for example the enum mapping PR I put up has support for the Enum
branch). These branches just have additional logic and I didn't want to balloon this PR.
That being said I probably should have included placeholders and then in turn tests able to reach the Skipper::from_avro
method as a part of this PR. That's definitely my mistake.
So I went ahead and added those placeholder branches and created an Avro file that covers every type currently supported by arrow-avro
using this python script: https://gist.github.com/jecsand838/82d9874a5f9be8a636dcd49ad9b8e237
Then I added a new test_skippable_types_project_each_field_individually
test to the arrow-avro/src/reader/mod.rs
file. This test behaves as you recommended in your other comment. Once the arrow-avro
Writer
has full type support, we can move towards a round trip approach as well. However the changes I just pushed up should include coverage for skipping each of those types now.
Thank you for catching this and calling it out!
That's a good callout! I can definitely do that. |
@alamb I appreciate the solid review! I went ahead and pushed up changes that should address your feedback. Let me know what you think when you get a second. Also I created a PR in the My general plan for these test files is to move them out of the |
66d6163
to
9423bf1
Compare
91fb2c7
to
54cb130
Compare
54cb130
to
ebf4029
Compare
Which issue does this PR close?
Rationale for this change
When reading Avro into Arrow with a projection or a reader schema that omits some writer fields, we were still decoding those writer‑only fields item‑by‑item. This is unnecessary work and can dominate CPU time for large arrays/maps or deeply nested records.
Avro’s binary format explicitly allows fast skipping for arrays/maps by encoding data in blocks: when the count is negative, the next
long
gives the byte size of the block, enabling O(1) skipping of that block without decoding each item. This PR teaches the record reader to recognize and leverage that, and to avoid constructing decoders for fields we will skip altogether.What changes are included in this PR?
Reader / decoding architecture
RecordResolved
) that carries:writer_to_reader
mapping for field alignment,arrow-avro/src/codec.rs
, record construction now:Field
s and their decoders only for fields that are read,skip_decoders
(viabuild_skip_decoders
) for fields to ignore.Tests
arrow-avro/src/reader/record.rs
)arrow-avro/src/reader/mod.rs
)avro/alltypes_plain.avro
to validate that projecting a subset of fields (reader schema omits some writer fields) both:Are these changes tested?
reader/mod.rs
:avro/alltypes_plain.avro
with a reader schema that omits several writer fields and asserts the resultingRecordBatch
matches the expected arrays while exercising the skip path.Are there any user-facing changes?
N/A since
arrow-avro
is not public yet.