Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-4039 [java] fix GenericData.newArray to only return an appropriate array implementation #3307

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mkeskells
Copy link

What is the purpose of the change

  • Fix the class cast exceptions noted in the ticket (when using logical types)
  • Fix other paths that can return PrimitiveArray when it would not be appropriate
  • Tightness the constraints for the return value, so if a GenericContainer is returned the schema must match the supplied schema

appropriate means that

  • If the suppled value could act as a container for the values that will be added, then clear its values, and reuse
  • If it is a GenericContainer and thus has a schema, then the schema is the same
    If we can't reuse the supplied value, then generate an appropriate collection, using the optimised values where we can

Updated the documentation, and added tests

Verifying this change

This change added tests and can be verified as follows:

  • Added unit tests to ensure that appropriate values are returned (as described above)

Documentation

  • Does this pull request introduce a new feature? (no)

only return an appropriate array
@github-actions github-actions bot added the Java Pull Requests for Java binding label Feb 6, 2025
@mkeskells
Copy link
Author

related issue - https://issues.apache.org/jira/browse/AVRO-4039

Mike Skells added 3 commits February 7, 2025 15:46
only return an appropriate array
fix import that spotless removed
Comment on lines 1520 to 1526
* different array implementation. By default, this returns a
* {@link GenericData.Array}.
*
* @param old the old array instance to reuse, if possible. If the old array
* is an appropriate type, it may be cleared and returned.
* @param size the size of the array to create.
* @param schema the schema of the array elements.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers
I have changed the spec of this method, specifically to

  • check the schema of the returned GenericContainer (if one is returned)

I am unsure of what other Collections may be passed. Should a set be appropriate. What if the collection had different semantics? (this however is existing behaviour and unchanged by this PR)

Comparison of the schema is via == not .equals as I thought .equals would be potentially expensive, and inappropriate to have in a performance optimisation.
I appreciate that this is very opinionated, and would welcome comments from developer who know Avro better that I do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the schema of the returned container is a nice option, but I don't know if it's generally more expensive to do a full equality check as opposed to creating a new collection. I'd refrain from a change like this until we know (and then I fully agree with it).

As to other collection types: these should all implement the Collection interface anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was on holiday last week, so didn't respond earlier
@opwvhk The returned schema check was just using == (identity check) rather than .equals (equality check) to avoid the cost explicitly
Can you confirm if this is OK, or should be removed as a check

My issue with he collection interface, is that reusing a Set will have different semantics that returning a new List

I guess thing sort of work OK ATM, but always wary about changing code when I know so little of the expectation of the behaviour

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opwvhk I have removed the schema validity check as requested

Comment on lines 1541 to 1543
if (schema.getElementType().getLogicalType() != null) {
return new GenericData.Array<Object>(size, schema);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that a logical type positiveInteger is a perfectly valid logical type for an int value. So we cannot assume an Object here (even though it's the most likely option).

Can you please use GenericData#getConversionFor(LogicalType), Conversion#getConvertedType() and Boolean#TYPE et.al. to determine the correct type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opwvhk that's - shows how little I know about avro
Will have a look and rework

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opwvhk reworked and added some tests

review feedback
remove schema check on returned value
Check convertors with logical types
Comment on lines 1532 to 1533
final var optimalValueType = PrimitivesArrays.optimalValueType(schema, logicalType,
conversion == null ? null : conversion.getConvertedType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice implementation, but IMHO there are two issues with it:

  1. the determination of the logical type and conversion and belongs inside as they're not used elsewhere
  2. it's used only here, but located in the class PrimitiveArrays

Let's move the method optimalValueType here.

It's a bit of a tricky choice, as determining an array element type is coupled with both the schema and the arrays, but IMHO it belongs more with the former: reading data into a type also happens for maps and record properties.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did too and fro a bit. DO you think we should add an emum for the ArrayType so that should other types be generated (e.g. for short)
Hard to open that for custom types

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the rework as suggested

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkeskells FYI, Arrow Java is now on a dedicated repository: https://github.com/apache/arrow-java

review feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Pull Requests for Java binding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants