-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVRO-4039 [java] fix GenericData.newArray to only return an appropriate array implementation #3307
base: main
Are you sure you want to change the base?
AVRO-4039 [java] fix GenericData.newArray to only return an appropriate array implementation #3307
Conversation
only return an appropriate array
related issue - https://issues.apache.org/jira/browse/AVRO-4039 |
only return an appropriate array
* different array implementation. By default, this returns a | ||
* {@link GenericData.Array}. | ||
* | ||
* @param old the old array instance to reuse, if possible. If the old array | ||
* is an appropriate type, it may be cleared and returned. | ||
* @param size the size of the array to create. | ||
* @param schema the schema of the array elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers
I have changed the spec of this method, specifically to
- check the schema of the returned
GenericContainer
(if one is returned)
I am unsure of what other Collection
s may be passed. Should a set be appropriate. What if the collection had different semantics? (this however is existing behaviour and unchanged by this PR)
Comparison of the schema is via ==
not .equals
as I thought .equals would be potentially expensive, and inappropriate to have in a performance optimisation.
I appreciate that this is very opinionated, and would welcome comments from developer who know Avro better that I do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the schema of the returned container is a nice option, but I don't know if it's generally more expensive to do a full equality check as opposed to creating a new collection. I'd refrain from a change like this until we know (and then I fully agree with it).
As to other collection types: these should all implement the Collection
interface anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was on holiday last week, so didn't respond earlier
@opwvhk The returned schema check was just using ==
(identity check) rather than .equals
(equality check) to avoid the cost explicitly
Can you confirm if this is OK, or should be removed as a check
My issue with he collection interface, is that reusing a Set
will have different semantics that returning a new List
I guess thing sort of work OK ATM, but always wary about changing code when I know so little of the expectation of the behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@opwvhk I have removed the schema validity check as requested
if (schema.getElementType().getLogicalType() != null) { | ||
return new GenericData.Array<Object>(size, schema); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that a logical type positiveInteger
is a perfectly valid logical type for an int
value. So we cannot assume an Object
here (even though it's the most likely option).
Can you please use GenericData#getConversionFor(LogicalType)
, Conversion#getConvertedType()
and Boolean#TYPE
et.al. to determine the correct type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@opwvhk that's - shows how little I know about avro
Will have a look and rework
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@opwvhk reworked and added some tests
final var optimalValueType = PrimitivesArrays.optimalValueType(schema, logicalType, | ||
conversion == null ? null : conversion.getConvertedType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice implementation, but IMHO there are two issues with it:
- the determination of the logical type and conversion and belongs inside as they're not used elsewhere
- it's used only here, but located in the class
PrimitiveArrays
Let's move the method optimalValueType
here.
It's a bit of a tricky choice, as determining an array element type is coupled with both the schema and the arrays, but IMHO it belongs more with the former: reading data into a type also happens for maps and record properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did too and fro a bit. DO you think we should add an emum for the ArrayType
so that should other types be generated (e.g. for short)
Hard to open that for custom types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the rework as suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkeskells FYI, Arrow Java is now on a dedicated repository: https://github.com/apache/arrow-java
What is the purpose of the change
PrimitiveArray
when it would not be appropriateGenericContainer
is returned the schema must match the supplied schemaappropriate means that
GenericContainer
and thus has a schema, then the schema is the sameIf we can't reuse the supplied value, then generate an appropriate collection, using the optimised values where we can
Updated the documentation, and added tests
Verifying this change
This change added tests and can be verified as follows:
Documentation