Skip to content

Conversation

mbrobbel
Copy link
Member

Which issue does this PR close?

Rationale for this change

As discussed in the linked issue we want to remove the experimental warning from the extension module.

When adding extension type information via the Field methods, ExtensionType::supports_datatype is used to validate the Field datatype is compatible with the given extension type. When users change the Field datatype, it's not clear if this is still true, so this PR proposes to remove the extension type information when changing datatype information.

What changes are included in this PR?

Adds two Field methods:

  • Field::clear_extension_type
  • Field::without_extension_type

These methods are used when changing data types via Field::set_data_type and Field::with_data_type.

Are these changes tested?

Yes, CI.

Are there any user-facing changes?

Yes, Field::set_data_type and Field::with_data_type now clear extension type information from the Field metadata.

@mbrobbel mbrobbel added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Sep 15, 2025
@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 15, 2025
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @mbrobbel

@alamb
Copy link
Contributor

alamb commented Sep 15, 2025

THANK YOU!

@mbrobbel mbrobbel added this to the 57.0.0 milestone Sep 15, 2025
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

#[inline]
pub fn set_data_type(&mut self, data_type: DataType) {
self.data_type = data_type;
self.clear_extension_type();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially worried that this would obliterate metadata that previously had been propagated, but I couldn't find any existing usage of this function (I don't even think it's tested in arrow-rs) and so I think it's probably not a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nobody actually uses set_data_type, can we deprecate and remove it?
It's kind of a weird API surface, especially with the new extension type semantics

/// ```
pub fn with_data_type(mut self, data_type: DataType) -> Self {
self.set_data_type(data_type);
self.without_extension_type()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed this.

This will probably obliterate metadata in some places. There's a bunch of code in datafusion that looks like this:

            let new_fields = fields
                .into_iter()
                .zip(new_data_types)
                .map(|(f, d)| f.as_ref().clone().with_data_type(d))
                .map(Arc::new)
                .collect::<Vec<FieldRef>>();

...where d is often unchanged but the outer field may have contained metadata. Sometimes the data type change is compatible with an extension type (e.g., reconciling String and StringViews), sometimes it is not (e.g., if there is any code that reconciles FixedSizeBinary and Binary inputs, which there may not be).

Using a Field as a data type is a constant whack-a-mole of these types of issues but I think this might be safer propagating the extension metadata until downstream projects have a chance to move away from that.

Copy link
Contributor

@scovich scovich Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems really dangerous... it will silently strip away the extension type information, which can propagate into e.g. the parquet file the column eventually gets written to, which could be very hard to fix later on when downstream readers start noticing the extension type info is unexpectedly missing.

Seems better to make this fallible (with_data_type becomes try_with_data_type), and blow up if the new data type is incompatible with the currently-installed extension type?

We could potentially still keep the original (infallible) version and panic if the new type is invalid. Doing that means people upgrading to arrow-57 won't necessarily realize there's a potential problem until their code starts crashing, which I don't love, but I suspect some people would rather deal with a potential panic than have to rewrite all their with_data_type calls?

Alternatively -- maybe we leave the panicky infallible version in place, but deprecated with an appropriate warning? But people with automated builds won't necessarily see the warning, especially if they already have lots of warnings in their code.

If somebody actually wanted to clear the extension type, it seems much better to require them to clearly state their intent by e.g. field.without_extension_type().with_data_type(...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is not a good solution. It would be nice to prevent invalid state via the API of Field, however, with set_metadata and metadata_mut it's already possible to set datatype incompatible extension type information, so either we accept it (and replace the clear calls added in this PR with a warning in the docs) or we reconsider the mutating methods of the Field API.

Seems better to make this fallible (with_data_type becomes try_with_data_type), and blow up if the new data type is incompatible with the currently-installed extension type?

The problem is you can't check data type compatibility without the actual extension type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think clear documentation warnings are best because (as @mbrobbel says) you don't know what combination is valid without the actual extension type.

Perhaps we can also add some examples of "how to change datatype and still validate extension types" or something to make it easier for downstream users to verify the expected extension types are still valid after modification

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that is indeed awkward that we don't have any easy way to map from a random extension name string to the actual type :(

@scovich
Copy link
Contributor

scovich commented Sep 18, 2025

Adds two Field methods:

* `Field::clear_extension_type`

* `Field::without_extension_type`

These methods are used when changing data types via Field::set_data_type and Field::with_data_type.

I don't think we need both? The current implementation of Field::with_data_type just delegates to Field::set_data_type?

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a better approach than silently and blindly clearing the extension type info? See comment.

Comment on lines 387 to +390
/// Set [`DataType`] of the [`Field`] and returns self.
///
/// This clears extension type information from the returned field by
/// calling [`Field::without_extension_type`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Set [`DataType`] of the [`Field`] and returns self.
///
/// This clears extension type information from the returned field by
/// calling [`Field::without_extension_type`].
/// Returns a new field with the requested [`DataType`].
///
/// Extension type information (if any) is not propagated.

Comment on lines +369 to +372
/// Set [`DataType`] of the [`Field`].
///
/// This clears extension type information from this field by calling
/// [`Field::clear_extension_type`].
Copy link
Contributor

@scovich scovich Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Set [`DataType`] of the [`Field`].
///
/// This clears extension type information from this field by calling
/// [`Field::clear_extension_type`].
/// Set [`DataType`] of the [`Field`], clearing any extension type information in the process.

#[inline]
pub fn set_data_type(&mut self, data_type: DataType) {
self.data_type = data_type;
self.clear_extension_type();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nobody actually uses set_data_type, can we deprecate and remove it?
It's kind of a weird API surface, especially with the new extension type semantics

/// ```
pub fn with_data_type(mut self, data_type: DataType) -> Self {
self.set_data_type(data_type);
self.without_extension_type()
Copy link
Contributor

@scovich scovich Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems really dangerous... it will silently strip away the extension type information, which can propagate into e.g. the parquet file the column eventually gets written to, which could be very hard to fix later on when downstream readers start noticing the extension type info is unexpectedly missing.

Seems better to make this fallible (with_data_type becomes try_with_data_type), and blow up if the new data type is incompatible with the currently-installed extension type?

We could potentially still keep the original (infallible) version and panic if the new type is invalid. Doing that means people upgrading to arrow-57 won't necessarily realize there's a potential problem until their code starts crashing, which I don't love, but I suspect some people would rather deal with a potential panic than have to rewrite all their with_data_type calls?

Alternatively -- maybe we leave the panicky infallible version in place, but deprecated with an appropriate warning? But people with automated builds won't necessarily see the warning, especially if they already have lots of warnings in their code.

If somebody actually wanted to clear the extension type, it seems much better to require them to clearly state their intent by e.g. field.without_extension_type().with_data_type(...)


//! Extension types.
//!
//! <div class="warning">This module is experimental. There might be breaking changes between minor releases.</div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate next-major-release the PR has API changes and it waiting on the next major version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make extension types fully supported (not experimental)
5 participants