-
Notifications
You must be signed in to change notification settings - Fork 94
GH-891: Add ExtensionTypeWriterFactory to TransferPair #892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
67334a6 to
7eba2c1
Compare
|
Hello, @lidavidm! Could you take a look at this PR? Also, I don't have permissions to change the label |
|
@jhrotko I will take a look on this one as soon as the CI is green (it should be good very soon). |
vector/src/main/java/org/apache/arrow/vector/util/TransferPairWithExtendedType.java
Outdated
Show resolved
Hide resolved
vector/src/main/java/org/apache/arrow/vector/util/TransferPairWithExtendedType.java
Outdated
Show resolved
Hide resolved
vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
Outdated
Show resolved
Hide resolved
laurentgo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really familiar with arrow vectors to be honest, but I wonder why writers aren't discovered at the same time the extension is being registered as a type? wouldn't that make things simpler from an API/usability perspective?
|
This PR changes how we handle extension type writers in Arrow Java. Instead of using factories that get passed around everywhere, we now let the ProblemIn Arrow's type system, each The previous implementation (commits // Usage in ComplexCopier
writer.addExtensionTypeWriterFactory(extensionTypeWriterFactory);
writer.writeExtension(value);In this pattern, each extension type had a separate factory class (like Why the factory pattern wasn't working wellFor developers implementing extension types outside of arrow-java, the situation was even more painful. You had to create and manage two separate classes: one for the type itself ( The factory pattern had several issues that made it difficult to scale at this point. Specially if you wanted to use Extension Arrow-java types mixed with out of arrow-java extension types which is something that might happen more often in the future. The API also got cluttered with factory parameters. Methods like Finally, the factory pattern created tight coupling between the type definition, the writer implementation, the factory that connects them, and all the code that needs to pass factories around. This made it harder to change any one piece without affecting the others. The new approach: Let types provide their own writersI added one abstract method to public abstract class ExtensionType extends ArrowType {
// NEW METHOD
public abstract FieldWriter getNewFieldWriter(ValueVector vector);
// Other methods...
}public class UuidType extends ExtensionType {
@Override
public FieldWriter getNewFieldWriter(ValueVector vector) {
return new UuidWriterImpl((UuidVector) vector);
}
// Other methods...
}The new approach is simpler because you only need one class per extension type now, not two. The type knows how to create its own writer. This also means the API is cleaner since there are no more factory parameters cluttering everything. For example, This approach is also consistent with how // MinorType enum (existing pattern)
public enum MinorType {
INT(new Int(...)) {
@Override
public FieldWriter getNewFieldWriter(ValueVector vector) {
return new IntWriterImpl((IntVector) vector);
}
},
// ...
}
// ExtensionType (new pattern - same idea)
public class UuidType extends ExtensionType {
@Override
public FieldWriter getNewFieldWriter(ValueVector vector) {
return new UuidWriterImpl((UuidVector) vector);
}
}Finally, there's less coupling overall. Writers don't need to store or manage factories anymore, TransferPair implementations are simpler, and the type information just flows naturally through the ComplexCopier got simpler// OLD: Required factory parameter
case EXTENSIONTYPE:
if (extensionTypeWriterFactory == null) {
throw new IllegalArgumentException("Must provide ExtensionTypeWriterFactory");
}
if (reader.isSet()) {
Object value = reader.readObject();
if (value != null) {
writer.addExtensionTypeWriterFactory(extensionTypeWriterFactory);
writer.writeExtension(value);
}
}
...
// NEW: Type provides the writer
case EXTENSIONTYPE:
if (reader.isSet()) {
Object value = reader.readObject();
if (value != null) {
writer.writeExtension(value, reader.getField().getType());
}
}
... |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a brief glance this approach seems more reasonable
| protected ArrowType lastExtensionType; | ||
|
|
||
| @Override | ||
| public void writeExtension(Object value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(design) should we deprecate this method? (since we now have writeExtension(ExtensionHolder)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if it should be deprecated, looking at other implementations they usually offer the writeX(X arg) ex.: writeInt, and write(XHolder holder)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe they do when there's no confusion about type/representation. But here we are relying on lastExtensionType to be set first via getWriter()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the issue with lastExtensionType state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is part of the 18.3.0 release, so removing it would be a breaking change. We could have a discussion if it okay or not. If it is, maybe we can be a bit more decisive on some other methods (like PromotableWriter#writeExtension(Object)) but otherwise, file need to be kept with a @Deprecated annotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to move forward with this design it's going to be a breaking change because the factory pattern was completely replaced, not deprecated alongside the new pattern. This will require users to migrate. Fortunately, the migration will be easy: Extension types must implement getNewFieldWriter() method and Extension holders need to implement the type() method as well and remove all factory references. I can provide a better migration guide in the PR description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're doing a major bump anyways it would be a good chance to improve things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added migration steps in PR description
|
@jarohen does XTDB use extension types? |
|
in AArch64 macOS latest Java JDK 11 it's a BasicAuth test seems unrelated with the changes In AMD64 integration related with apache/arrow-go#571 and In AMD64 macOS 13 Java JDK 11 the error also seems unrelated |
@lidavidm it does, but we have our own mechanisms for that outside of arrow-java I'm afraid. We're mostly using arrow-java for the IPC and memory management these days - we needed too many bespoke access patterns of the vectors themselves (particularly DUV) and didn't feel it reasonable to expect you folks to bend over backwards just for us 😄 That said, XT's all open source, feel free to pinch what you like, and I'm more'n happy to talk more about it (maybe a different thread though), if there's anything we can contribute back 🙂 |
|
Thanks for the confirmation! Just wanted to evaluate how this might affect you if we went ahead, sounds like it wouldn't be a problem |
What's Changed
This PR simplifies extension type writer creation by moving from a factory-based pattern to a type-based pattern. Instead of passing
ExtensionTypeWriterFactoryinstances through multiple API layers, extension types now provide their own writers via a newgetNewFieldWriter()method onArrowType.ExtensionType.getNewFieldWriter(ValueVector)abstract method toArrowType.ExtensionTypeExtensionTypeWriterFactoryinterface and all implementationsComplexCopier,PromotableWriter, andTransferPairAPIsUnionWriterto support extension types (previously threwUnsupportedOperationException)UuidType,OpaqueType)The factory pattern didn't scale well. Each new extension type required creating a separate factory class and passing it through multiple API layers. This was especially painful for external developers who had to maintain two classes per extension type and manage factory parameters everywhere.
The new approach follows the same pattern as
MinorType, where each type knows how to create its own writer. This reduces boilerplate, simplifies the API, and makes it easier to implement custom extension types outside arrow-java.Breaking Changes
ExtensionTypeWriterFactoryhas been removedgetNewFieldWriter(ValueVector vector)methodtype()which returns theExtensionTypefor that HolderMigration Guide
getNewFieldWriter(ValueVector vector)methodtype()which returns theExtensionTypefor that HolderHow to use Extension Writers?
Before:
After:
Also
copyAsValuedoes not need to provide the factory anymore.Closes #891 .