-
Notifications
You must be signed in to change notification settings - Fork 1.1k
AvroError enum for arrow-avro crate #8759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nathaniel-d-ef -- I think this makes sense to me, but it is a breaking API change and thus we will have to wait until the next major release in a few months:
FYI @jecsand838 and @mbrobbel
arrow-avro/src/errors.rs
Outdated
| EOF(String), | ||
| /// Arrow error. | ||
| /// Returned when reading into arrow or writing from arrow. | ||
| ArrowError(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this instead keep the actual Arrow error rather than converting directly into a string?
| ArrowError(String), | |
| ArrowError(Box<ArrowError>)), |
I realize that is what ParquetError::ArrowError does the same thing -- but I think we might want to change that Parquet error as well
arrow-avro/src/errors.rs
Outdated
| } | ||
| } | ||
|
|
||
| impl From<cell::BorrowMutError> for AvroError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a fairly specific conversion -- maybe it would be simpler to annotate the locations where this happens with map_err(|e| AvroError::From(Box::new(e)) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, my mistake. This is unnecessary, I removed.
| } | ||
| return out | ||
| .write_all(&src_be[extra..]) | ||
| .map_err(|e| ArrowError::IoError(format!("write decimal fixed: {e}"), e)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change does appear to lose some error context. Would it be better to keep the information that this came from write decimal fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good point. I adjusted this to use the General error and pass the contextual info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this addresses @alamb's concerns. Now e is string-ified and it losses the io::Error source. AFAIS only ::External and ::ArrowError preserve the source/cause (https://github.com/elastiflow/arrow-rs/blob/b40bb9c6b70ecefee8647e0c3dd9be77bc6ca2bf/arrow-avro/src/errors.rs#L54)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nathaniel-d-ef This doesn't seem to be fully addressed imo. Several write_all(...) failures are still converted to AvroError::General(format!(...)), which drops the io::Error source entirely, recreating the original concern. This shows up specifically in write_sign_extended (padding writes) and write_optional_index.
|
@nathaniel-d-ef I'll do a deeper review tonight, but if we plan to go this direction, should we also remove the I never got around to wiring that up before public release. But the original intent was to align CC: @alamb |
I see what you mean. I don't feel strongly either way. On the whole we're probably unlikely to use all of the variants here, so if it's preferable to roll this back a bit and implement the |
| } | ||
| return out | ||
| .write_all(&src_be[extra..]) | ||
| .map_err(|e| ArrowError::IoError(format!("write decimal fixed: {e}"), e)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this addresses @alamb's concerns. Now e is string-ified and it losses the io::Error source. AFAIS only ::External and ::ArrowError preserve the source/cause (https://github.com/elastiflow/arrow-rs/blob/b40bb9c6b70ecefee8647e0c3dd9be77bc6ca2bf/arrow-avro/src/errors.rs#L54)
|
main is now open for breaking API changes, if you would like to merge this one, we can do so. I can't remember where we landed |
Not sure it was 100% decided. My vote was to just use |
|
@jecsand838 I think the changes move things forward, so if you're good with them let's go ahead and merge. I should be able to clean up the conflicts in the next few days if no one beats me to it. I need to read up on what's changed; I've been out of the loop. |
|
Sounds good -- I'll plan to merge the PR once it is ready |
…ghout arrow-avro crate. # Conflicts: # arrow-avro/src/writer/encoder.rs
…ling across the arrow-avro crate
37b843b to
d8c43b0
Compare
jecsand838
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reviewed your latest changes. Overall this is looking really good and I love the idea of the custom AvroError enum. The major callouts in this review relate to:
- Whether we can return
AvroErrorin the public api (likeparquet). - Fully addressing some previous comments.
Also it maybe a good idea to have @getChan take a look to ensure these changes work with the latest work in apache/datafusion#17861
arrow-avro/src/reader/header.rs
Outdated
| decoder.decode(b"Ob").unwrap(); | ||
| let err = decoder.decode(b"s").unwrap_err().to_string(); | ||
| assert_eq!(err, "Parser error: Incorrect avro magic"); | ||
| assert_eq!(err, "Parse error: Incorrect avro magic"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we needed to change the test expectation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the avro error implementation changed the phrase slightly. I fixed it there.
arrow-avro/src/writer/encoder.rs
Outdated
| .map_err(|e| ArrowError::IoError(format!("write decimal fixed: {e}"), e))?; | ||
| .map_err(|e| AvroError::General(format!("write decimal fixed: {e}")))?; | ||
| rem -= pad.len(); | ||
| } | ||
| if rem > 0 { | ||
| out.write_all(&pad[..rem]) | ||
| .map_err(|e| ArrowError::IoError(format!("write decimal fixed: {e}"), e))?; | ||
| .map_err(|e| AvroError::General(format!("write decimal fixed: {e}")))?; | ||
| } | ||
| out.write_all(src_be) | ||
| .map_err(|e| ArrowError::IoError(format!("write decimal fixed: {e}"), e)) | ||
| .map_err(|e| AvroError::General(format!("write decimal fixed: {e}"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to preserve the error context in these map_err 's as well imo.
arrow-avro/src/reader/mod.rs
Outdated
| magic: &[u8; MAGIC_LEN], | ||
| fingerprint_from: impl FnOnce([u8; N]) -> Fingerprint, | ||
| ) -> Result<Option<usize>, ArrowError> { | ||
| ) -> Result<Option<usize>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This maybe just me, but not specifying the error enum being passed back makes the code a bit trickier for me to grok. Especially now that we have multiple error enums.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally fair, I'm fine leaning more on the explicit side, it doesn't hurt. I'll adjust.
arrow-avro/src/reader/mod.rs
Outdated
| let batch = self.flush_and_reset(); | ||
| self.apply_pending_schema(); | ||
| batch | ||
| batch.map_err(ArrowError::from) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually curious, why are we not following the pattern used by parquet and just return the pub AvroError? I thought that was the big reason for doing this PR and waiting for a major release?
I re-read the PR description and just caught this part, not sure I fully agree. I'm thinking we should either return AvroError or ArrowError::AvroError as part of the public API to align with the other crates and help downstream callers (if possible ofc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a while since I kicked off this work, but I think it was a misinterpretation of the ask. I took another look at this and revised it.
arrow-avro/src/errors.rs
Outdated
| } | ||
|
|
||
| /// A specialized `Result` for Avro errors. | ||
| pub type Result<T, E = AvroError> = result::Result<T, E>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we should remove this imo. From a reviewer perspective, the custom Result caused some confusion in correctly understanding the diff.
| pub type Result<T, E = AvroError> = result::Result<T, E>; |
arrow-avro/src/reader/cursor.rs
Outdated
| return Err(ArrowError::ParseError( | ||
| "Unexpected EOF reading float".to_string(), | ||
| )); | ||
| return Err(AvroError::EOF("Unexpected EOF reading float".to_string())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was pre-existing, but maybe worth taking the chance to correct?
| return Err(AvroError::EOF("Unexpected EOF reading float".to_string())); | |
| return Err(AvroError::EOF("Unexpected EOF reading double".to_string())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good catch, yes
| #[derive(Debug)] | ||
| #[non_exhaustive] | ||
| pub enum AvroError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these errors are internal only, then why make this pub?
My preference would be to return pub if possible (like in the parquet crate), but if we can't shouldn't we make this pub(crate)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left this alone having changed the boundary type.
| pub mod codec; | ||
|
|
||
| /// AvroError variants | ||
| pub mod errors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, if internal only, then should we make this change?
| pub mod errors; | |
| pub(crate) mod errors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
arrow-avro/src/reader/mod.rs
Outdated
| let writer_schema = hdr | ||
| .schema() | ||
| .map_err(|e| ArrowError::ExternalError(Box::new(e)))? | ||
| .map_err(|e| AvroError::External(Box::new(e)))? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because schema returns Result<Option<Schema<'_>>> // Result<_, AvroError> (the custom Result made this less readable for me), the closure |e| AvroError::External(Box::new(e)) is still wrapping an AvroError inside AvroError::External.
I think @martin-g called this out in another comment as well.
You can probably improve this just by something like this:
let writer_schema = hdr
.schema()?
.ok_or_else(|| AvroError::ParseError("No Avro schema present in file header".into()))?;
Which issue does this PR close?
Rationale for this change
The arrow-avro crate currently uses
ArrowErrorthroughout. This lacks the level of precision other crates in the project, such as Parquet, have.What changes are included in this PR?
Resultutility on the AvroError allows for Result<T, AvroError> to be written as Result.Are these changes tested?
No new functionality has been introduced, all existing tests are passing.
Are there any user-facing changes?
There shouldn't be - the API signatures remain the same.