Skip to content

Experiment with Extension as a DataType #7398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

paleolimbot
Copy link
Member

Which issue does this PR close?

Experiments in pursuit of #7063 and apache/datafusion#12644 .

Rationale for this change

Customizing behaviour (e.g., sorting, signatures for user defined functions, import/export from/to formats like Parquet, Arrow IPC, and FFI) is increasingly requested. arrow-rs is also increasingly the implementation of choice for compute frameworks and many APIs have been built around the DataType and ArrayRef (e.g., GeoArrow, parquet, many parts of DataFusion). To support additional types, those frameworks have to either invent a new array type and rewrite substantial wrappers (e.g., GeoArrow), or use something other than a DataType (e.g., a Field or MyOwnDataType) and forego having that type automatically pass through APIs from other arrow-rs-based libraries (or arrow itself).

Incorporating an ExtensionType as a first-class data type provides a potentially less disruptive route for libraries like parquet and DataFusion to implement new Parquet types and other types other databases implement like JSON and UUID. I also have an experiment going for swapping out a Field for a DataType in DataFusion ( apache/datafusion#15036 ).

What changes are included in this PR?

A new DataType::Extension() enum member was added. I had hoped to have this be DataType::Extension(Arc<dyn ExtensionType>) but the current design of the ExtensionType is not dyn-compatible. The DynExtensionType trait is really a placeholder to collect the requirements of the rest of arrow-rs for this experiment.

This is currently just an experiment/minimum change that actually builds! I'm happy to continue the experiment if there is interest.

Are there any user-facing changes?

Yes, a new data type enum member at least (which is a breaking change).

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Apr 9, 2025
@paleolimbot paleolimbot changed the title Type extension maybe Experiment with Extension as a DataType Apr 9, 2025
@@ -276,6 +276,7 @@ fn build_extend(array: &ArrayData) -> Extend {
UnionMode::Dense => union::build_extend_dense(array),
},
DataType::RunEndEncoded(_, _) => todo!(),
DataType::Extension(_) => unimplemented!("Extension not implemented"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This highlights the challenge with this approach, every kernel/function must now be updated to accommodate this new extension type. Not only does this stand a very high chance of regressing functionality, but also is IMO not in the spirit of extension types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the tradeoff is taking on the burden of extensibility (so that other projects don't have to implement it separately, for example, by redefining wrapper DataTypes or Arrays). This extra burden hasn't caused issues in Arrow C++/pyarrow that I'm aware of but absolutely fair that it might here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants