Skip to content

Conversation

@jhrotko
Copy link

@jhrotko jhrotko commented Oct 22, 2025

What's Changed

This PR simplifies extension type writer creation by moving from a factory-based pattern to a type-based pattern. Instead of passing ExtensionTypeWriterFactory instances through multiple API layers, extension types now provide their own writers via a new getNewFieldWriter() method on ArrowType.ExtensionType.

  • Added getNewFieldWriter(ValueVector) abstract method to ArrowType.ExtensionType
  • Removed ExtensionTypeWriterFactory interface and all implementations
  • Removed factory parameters from ComplexCopier, PromotableWriter, and TransferPair APIs
  • Updated UnionWriter to support extension types (previously threw UnsupportedOperationException)
  • Simplified extension type implementations (UuidType, OpaqueType)

The factory pattern didn't scale well. Each new extension type required creating a separate factory class and passing it through multiple API layers. This was especially painful for external developers who had to maintain two classes per extension type and manage factory parameters everywhere.

The new approach follows the same pattern as MinorType, where each type knows how to create its own writer. This reduces boilerplate, simplifies the API, and makes it easier to implement custom extension types outside arrow-java.

Breaking Changes

  • ExtensionTypeWriterFactory has been removed
  • Extension types must now implement getNewFieldWriter(ValueVector vector) method
  • ExtensionHolders must implement type() which returns the ExtensionType for that Holder
  • (Writers are obtained directly from the extension type, not from a factory)

Migration Guide

  • Extension types must now implement getNewFieldWriter(ValueVector vector) method
public class UuidType extends ExtensionType {
  ...

  @Override
  public FieldWriter getNewFieldWriter(ValueVector vector) {
    return new UuidWriterImpl((UuidVector) vector);
  }
  ...
}
  • ExtensionHolders must implement type() which returns the ExtensionType for that Holder
public class UuidHolder extends ExtensionHolder {
   ...

  @Override
  public ArrowType type() {
    return UuidType.INSTANCE;
  }
  • How to use Extension Writers?
    Before:

      writer.extension(UuidType.INSTANCE);
      writer.addExtensionTypeWriterFactory(extensionTypeWriterFactory);
      writer.writeExtension(value);

    After:

      writer.extension(UuidType.INSTANCE);
      writer.writeExtension(value);
  • Also copyAsValue does not need to provide the factory anymore.

Closes #891 .

@github-actions

This comment has been minimized.

@jhrotko jhrotko force-pushed the GH-891 branch 2 times, most recently from 67334a6 to 7eba2c1 Compare October 22, 2025 21:09
@jhrotko jhrotko marked this pull request as ready for review October 22, 2025 21:13
@jhrotko
Copy link
Author

jhrotko commented Oct 23, 2025

Hello, @lidavidm! Could you take a look at this PR? Also, I don't have permissions to change the label

@lidavidm lidavidm added the enhancement PRs that add or improve features. label Oct 23, 2025
@github-actions github-actions bot added this to the 18.4.0 milestone Oct 23, 2025
@jbonofre
Copy link
Member

@jhrotko I will take a look on this one as soon as the CI is green (it should be good very soon).

@jhrotko jhrotko requested a review from laurentgo October 30, 2025 09:39
Copy link
Contributor

@laurentgo laurentgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really familiar with arrow vectors to be honest, but I wonder why writers aren't discovered at the same time the extension is being registered as a type? wouldn't that make things simpler from an API/usability perspective?

@jhrotko
Copy link
Author

jhrotko commented Nov 7, 2025

This PR changes how we handle extension type writers in Arrow Java. Instead of using factories that get passed around everywhere, we now let the ArrowType.ExtensionType itself provide the writer implementation. This makes the API simpler and easier to work with, especially if you're implementing custom extension types outside arrow-java.

Problem

In Arrow's type system, each MinorType (INT, FLOAT, VARCHAR, etc.) has its own writer implementation. Extension types are trickier though, because they all share the same MinorType.EXTENSIONTYPE, but each extension type (UUID, Opaque, custom types) needs its own writer implementation. We needed some way to figure out which writer to use for a given extension type.

The previous implementation (commits 34060eb4, 7a7e4edd, 7fe36d70, 8663ffc6) used an ExtensionTypeWriterFactory pattern:

// Usage in ComplexCopier
writer.addExtensionTypeWriterFactory(extensionTypeWriterFactory);
writer.writeExtension(value);

In this pattern, each extension type had a separate factory class (like UuidWriterFactory) that was passed around as a parameter for copy methods. The custom extension writers stored these factories and used them to create the appropriate writer.
The TransferPair interface implementations also needed to carry factories, which polluted other ValueVector classes such as IntVectors and so on.

Why the factory pattern wasn't working well

For developers implementing extension types outside of arrow-java, the situation was even more painful. You had to create and manage two separate classes: one for the type itself (MyCustomType extends ExtensionType) and another for the factory (MyCustomWriterFactory implements ExtensionTypeWriterFactory).

The factory pattern had several issues that made it difficult to scale at this point. Specially if you wanted to use Extension Arrow-java types mixed with out of arrow-java extension types which is something that might happen more often in the future.

The API also got cluttered with factory parameters. Methods like ComplexCopier.copy(reader, writer, extensionTypeWriterFactory), writer.addExtensionTypeWriterFactory(factory), and TransferPair.makeTransferPair(target, factory) all needed these extra parameters. This made the API harder to use and understand.

Finally, the factory pattern created tight coupling between the type definition, the writer implementation, the factory that connects them, and all the code that needs to pass factories around. This made it harder to change any one piece without affecting the others.

The new approach: Let types provide their own writers

I added one abstract method to ArrowType.ExtensionType:

public abstract class ExtensionType extends ArrowType {
    // NEW METHOD
    public abstract FieldWriter getNewFieldWriter(ValueVector vector);

   // Other methods...
}
public class UuidType extends ExtensionType {
    @Override
    public FieldWriter getNewFieldWriter(ValueVector vector) {
        return new UuidWriterImpl((UuidVector) vector);
    }
    
    // Other methods...
}

The new approach is simpler because you only need one class per extension type now, not two. The type knows how to create its own writer. This also means the API is cleaner since there are no more factory parameters cluttering everything. For example, ComplexCopier.copy(reader, writer) and writer.writeExtension(value, type) are much more straightforward, and the type provides the writer internally through extensionType.getNewFieldWriter(vector).

This approach is also consistent with how MinorType already works. The existing pattern for MinorType has each enum constant override getNewFieldWriter() to return its specific writer implementation. Extension types now follow the same pattern:

// MinorType enum (existing pattern)
public enum MinorType {
    INT(new Int(...)) {
        @Override
        public FieldWriter getNewFieldWriter(ValueVector vector) {
            return new IntWriterImpl((IntVector) vector);
        }
    },
    // ...
}

// ExtensionType (new pattern - same idea)
public class UuidType extends ExtensionType {
    @Override
    public FieldWriter getNewFieldWriter(ValueVector vector) {
        return new UuidWriterImpl((UuidVector) vector);
    }
}

Finally, there's less coupling overall. Writers don't need to store or manage factories anymore, TransferPair implementations are simpler, and the type information just flows naturally through the ArrowType object.

ComplexCopier got simpler

// OLD: Required factory parameter
case EXTENSIONTYPE:
   if (extensionTypeWriterFactory == null) {
      throw new IllegalArgumentException("Must provide ExtensionTypeWriterFactory");
    }
    if (reader.isSet()) {
      Object value = reader.readObject();
       if (value != null) {
         writer.addExtensionTypeWriterFactory(extensionTypeWriterFactory);
         writer.writeExtension(value);
       }
     }
   ...

// NEW: Type provides the writer
case EXTENSIONTYPE:
    if (reader.isSet()) {
        Object value = reader.readObject();
        if (value != null) {
            writer.writeExtension(value, reader.getField().getType());
        }
    }
...

@jhrotko
Copy link
Author

jhrotko commented Nov 7, 2025

@lidavidm @xxlaykxx could you also take a look?

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a brief glance this approach seems more reasonable

protected ArrowType lastExtensionType;

@Override
public void writeExtension(Object value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(design) should we deprecate this method? (since we now have writeExtension(ExtensionHolder)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it should be deprecated, looking at other implementations they usually offer the writeX(X arg) ex.: writeInt, and write(XHolder holder)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe they do when there's no confusion about type/representation. But here we are relying on lastExtensionType to be set first via getWriter()

Copy link
Author

@jhrotko jhrotko Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the issue with lastExtensionType state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is part of the 18.3.0 release, so removing it would be a breaking change. We could have a discussion if it okay or not. If it is, maybe we can be a bit more decisive on some other methods (like PromotableWriter#writeExtension(Object)) but otherwise, file need to be kept with a @Deprecated annotation

Copy link
Author

@jhrotko jhrotko Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to move forward with this design it's going to be a breaking change because the factory pattern was completely replaced, not deprecated alongside the new pattern. This will require users to migrate. Fortunately, the migration will be easy: Extension types must implement getNewFieldWriter() method and Extension holders need to implement the type() method as well and remove all factory references. I can provide a better migration guide in the PR description

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing a major bump anyways it would be a good chance to improve things.

Copy link
Author

@jhrotko jhrotko Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added migration steps in PR description

@lidavidm
Copy link
Member

@jarohen does XTDB use extension types?

@jhrotko
Copy link
Author

jhrotko commented Nov 11, 2025

in AArch64 macOS latest Java JDK 11 it's a BasicAuth test seems unrelated with the changes

Error:  org.apache.arrow.flight.auth.TestBasicAuth -- Time elapsed: 1.149 s <<< ERROR!
java.lang.IllegalStateException: 
Memory was leaked by query. Memory leaked: (65536)
Allocator(ROOT) 0/65536/131072/9223372036854775807 (res/actual/peak/limit)

	at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:504)
	at org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:27)
	at org.apache.arrow.util.AutoCloseables.close(AutoCloseables.java:97)
	at org.apache.arrow.util.AutoCloseables.close(AutoCloseables.java:75)
	at org.apache.arrow.flight.auth.TestBasicAuth.shutdown(TestBasicAuth.java:181)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
	at java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)

In AMD64 integration related with apache/arrow-go#571

and In AMD64 macOS 13 Java JDK 11 the error also seems unrelated

Error:  Could not acquire lock(s)
Error:  java.lang.IllegalStateException: Could not acquire lock(s)
Error:  
Error:  To see the full stack trace of the errors, re-run Maven with the -e switch.
Error:  Re-run Maven using the -X switch to enable full debug logging.
Error: Process completed with exit code 1.

@jarohen
Copy link
Contributor

jarohen commented Nov 13, 2025

@jarohen does XTDB use extension types?

@lidavidm it does, but we have our own mechanisms for that outside of arrow-java I'm afraid. We're mostly using arrow-java for the IPC and memory management these days - we needed too many bespoke access patterns of the vectors themselves (particularly DUV) and didn't feel it reasonable to expect you folks to bend over backwards just for us 😄

That said, XT's all open source, feel free to pinch what you like, and I'm more'n happy to talk more about it (maybe a different thread though), if there's anything we can contribute back 🙂

@lidavidm
Copy link
Member

Thanks for the confirmation! Just wanted to evaluate how this might affect you if we went ahead, sounds like it wouldn't be a problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change enhancement PRs that add or improve features.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ExtensionTypeWriterFactory to TransferPair

5 participants