Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 16, 2025

Which issue does this PR close?

Rationale for this change

We need a way to write Variant encoded data to/from parquet, and the current
way the VariantArray is implemented doesn't work (panics when writing to parquet)

What changes are included in this PR?

  1. remove the Array impl for VariantArray, which forces explict conversions back/forth when writing
  2. Recognize the parquet Variant logical type and add the correct Arrow extension type metadata
  3. Write the parquet Variant logical type when the correct arrow extension type metadata

As a follow on PR I plan to give the same basic treatment to ShreddedVariantStateArray

Are these changes tested?

yes, by existing tests and new (doc) tests

Are there any user-facing changes?

Yes, but this is not yet stable / released, so these changes have no impact on the releasability of this code

  1. VariantArray no longer implements Array

@github-actions github-actions bot added parquet-variant parquet-variant* crates parquet Changes to the parquet crate labels Sep 16, 2025
@alamb alamb changed the title Support reading/writing VariantArray to parquet, VariantArray does not implement Array Support reading/writing VariantArray to parquet with correct LogicalType Sep 17, 2025
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Initial pass didn't turn up anything scary?

Comment on lines +56 to +62
// Note don't check for metadata/value fields here because they may be
// absent in shredded variants
if matches!(data_type, DataType::Struct(_)) {
Ok(())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I thought of while working on shredding, variant_get, etc -- what if we supported a metadata that indicated the desired handling of shredded columns? So for example (naming TBD), a caller could annotate a binary variant type with "keep-shredding" to indicate that reads and variant_get calls should preserve the existing shredding structure of the underlying data, or "unshred" to indicate that such calls should return binary variant even if the underlying data are shredded. The output data would not contain any annotation, because it is what it is.

Such an approach would solve the problem of how to request partially shredded results using variant_get, for example. Today, we can get that behavior by passing as_type=None, but that doesn't work if I want a nested type with some fields shredded and others left as-is.

Copy link
Contributor Author

@alamb alamb Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the problem, and thus probably don't fully understand your proposal

I would expect that variant_get would always preserve the existing shredding structure, and if the user wants to unshred the result (e.g. ensure the output of variant_get is an unshredded variant) and that I would specify any conversions as part of variant_get rather than some annotation on the VariantArray itself

Maybe we could add a variant_unshred kernel, potentially with a path argument to only partially unshred an object 🤔 I am a little unclear on the usecase for that type of operation though

Copy link
Contributor

@scovich scovich Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few cases:

  • Pass as_type=None -- this would preserve existing shredding structure, if any
  • Pass as_type=Some(... VariantType...) -- what schema should the underlying struct have, to:
    • Force shredding as a specific schema
    • Force unshredding to binary
    • Return whatever was there (equivalent to passing None)
  • Pass as_type=Some(StructType(... VariantType ...)) -- this requests to shred the variant value as a struct, but that the field(s) inside that struct are variant-typed (not shredded). The same three questions apply, but now as_type=None is not a fallback
    • We had talked previously about a providing function that could return the shredding schema for a given path, so the user can stub in the variant-typed fields however they want. But that's pretty fiddly.
    • The alternative (which I proposed above) was to allow annotating the variant type to indicate that variant_get should preserve the underlying shredding structure, without having to actually specify it. Without that annotation, the variant's schema is taken literally and may cause shredding, reshredding, or unshredding depending on the underlying data's structure.
    • Much cleaner, and actually simpler for the implementation as well. Once the field's path is exhausted, just return whatever was there (similar to how as_type=None is implemented today); if we passed an explicit shredding schema instead, then the implementation would have to examine every node of the schema, all the way to leaf level, in case the provided shredding structure doesn't actually match the underlying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the field's path is exhausted, just return whatever was there (similar to how as_type=None is implemented today); if we passed an explicit shredding schema instead, then the implementation would have to examine every node of the schema, all the way to leaf level, in case the provided shredding structure doesn't actually match the underlying.

This is sort of what I expected to happen today.

Maybe we could use the cast_to_variant kernel for this? For example, if I had a VariantArray with some arbitrary shredding, I could do:

let array: VariantArray = ...; // input somehow
// get the "foo" field (may be shredded).
let foo = variant_get(array, Path::from("foo"))?
// ensure foo is non shredded
let unshredded_foo = cast_to_variant(foo)?;

🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario I'm thinking of is a more forgiving replacement for normal json parsing, where the requested schema has a mix of strongly-typed and variant fields. So parse JSON to variant, and then use variant_get to extract the actual schema of interest. Fields that are known to have inconsistent (or legally flexible) behavior would be kept as variant, with strict typing enforcement applied only to the strongly typed fields (similar to today's JSON parsing with a schema). But in that case, we have a (potentially deeply) nested struct with multiple variant leaf fields, and fetching individual fields one at a time would be pretty annoying.

If the caller wants a specific schema -- whether binary variant or a specific shredding -- it's easy enough to pass that. The difficulty comes if the caller just wants to get back whatever flavor of variant is already there (without shredding or unshredding it first).

I don't see how cast_to_variant would help in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how cast_to_variant would help in that case?

I don't think it would (I misunderstood what you were getting at)

But in that case, we have a (potentially deeply) nested struct with multiple variant leaf fields, and fetching individual fields one at a time would be pretty annoying.
The difficulty comes if the caller just wants to get back whatever flavor of variant is already there (without shredding or unshredding it first).

I am confused -- if we are parsing JSON, then it will always be variant (no shredded), so I am not sure how there would be different flavors of variant. Maybe this is a good thing for a call or some examples

Also I wonder if maybe we should open a new ticket as this conversation is likely to get pretty lost

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2025

Nice. Initial pass didn't turn up anything scary?

Not really -- it seems to be going fine. The biggest issue I hit ss that currently the parquet reader will read Binary columns as Binary rather than BinaryView, which is somewhat awkward.

In this PR I still need to implement writing logical types too, and then clean up the code. I hope to have time to finish it later today or tomorrow

@scovich
Copy link
Contributor

scovich commented Sep 17, 2025

the parquet reader will read Binary columns as Binary rather than BinaryView, which is somewhat awkward.

Should we consider just adding support for Binary columns?

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2025

the parquet reader will read Binary columns as Binary rather than BinaryView, which is somewhat awkward.

Should we consider just adding support for Binary columns?

I think we will need to do so eventually, given the arrow spec explicitly says they are supported

However, I think we can add that support in parallel (I would like to avoid blocking logical type support on Binary support)

@alamb alamb force-pushed the alamb/variant_array_roundtrip branch from 57c75b3 to 7f7c2ee Compare September 17, 2025 21:41
@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2025

I need to do a little more cleanup tomorrow morning, then this should be ready to go

@alamb alamb changed the title Support reading/writing VariantArray to parquet with correct LogicalType Support reading/writing VariantArray to parquet with Variant LogicalType Sep 18, 2025
@alamb
Copy link
Contributor Author

alamb commented Sep 18, 2025

the parquet reader will read Binary columns as Binary rather than BinaryView, which is somewhat awkward.

Should we consider just adding support for Binary columns?

I think we will need to do so eventually, given the arrow spec explicitly says they are supported

However, I think we can add that support in parallel (I would like to avoid blocking logical type support on Binary support)

Tracking with #8387

@alamb alamb force-pushed the alamb/variant_array_roundtrip branch 3 times, most recently from 170c67a to db66c71 Compare September 18, 2025 20:55
@alamb
Copy link
Contributor Author

alamb commented Sep 18, 2025

Update here is I think this PR is now in pretty good shape (it does the Logical type annotations and has examples of reading/writing them)

However, it is pretty large to request a review on. My plan is to break out the "remove Array impl` part into a separate PR for easier review

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can actually see the end to end story now, nice!

Some(LogicalType::Variant) => {
arrow_field.with_extension_type(parquet_variant_compute::VariantType)
}
// TODO add other LogicalTypes here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, how do the other parquet logical types map to arrow types? Seems like JSON, BSON, UUID, etc will face the same problem as Variant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are interleaved in the schema conversion code with cfg like this:

#[cfg(feature = "arrow_canonical_extension_types")]
if let Some(logical_type) = basic_info.logical_type() {
match logical_type {
LogicalType::Uuid => ret.try_with_extension_type(Uuid)?,
LogicalType::Json => ret.try_with_extension_type(Json::default())?,
_ => {}
}
}

I plan to move the code handling the other types into the extension.rs module as a follow on PR

Comment on lines +50 to +56
#[cfg(feature = "variant_experimental")]
pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> {
use parquet_variant_compute::VariantType;
if field.extension_type_name()? == VariantType::NAME {
return Some(LogicalType::Variant);
};
None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit odd that this API call's existence would be gated by the variant feature?

Suggested change
#[cfg(feature = "variant_experimental")]
pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> {
use parquet_variant_compute::VariantType;
if field.extension_type_name()? == VariantType::NAME {
return Some(LogicalType::Variant);
};
None
pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> {
let type_name = field.extension_type_name()?;
#[cfg(feature = "variant_experimental")]
{
use parquet_variant_compute::VariantType;
if type_name == VariantType::NAME {
return Some(LogicalType::Variant);
}
}
None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, there's a cfg-not version below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But shouldn't we follow the example in other parts of the code, and try_extension_type rather than merely checking the name? Otherwise, somebody could slap the variant annotation on some completely irrelevant struct, with no validation to catch the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May rationale here (which I will add as a comment) was that creating an ArrowError is expensive (as it allocates a String) so doing try_extension_type().ok() would add an allocation to all code paths even those that didn't have Variant type.

I will try and make this clearer, and maybe add some more docs to the extension type code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a flaw in the API. Should it return Option<Result<E>> to cover the "not an extension type at all" case? Either way, people have to manually check something.

BTW, the JSON extension type handling for string columns is doing the unconditional try_extension_type, so maybe we should fix that one as well? (either by the API change or by adding a specific check for existence of metadata before making the try_extension_type call)

//! let file = std::fs::File::open(file_path())?;
//! let mut reader = ArrowReaderBuilder::try_new(file)?.build()?;
//!
//! // You can find the Variant using the VariantType extension type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"find" ?

Copy link
Contributor Author

@alamb alamb Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified in 0bea137

Comment on lines +161 to +164
/// Ensure a file with Variant LogicalType, written by another writer in
/// parquet-testing, can be read as a VariantArray
#[test]
fn read_logical_type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the ~same as the doctest above? Do we really need both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right -- it is the same. I wrote this test before the doctest. I'll remove this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked, and this test also does some more stuff like testing sizes. Do you think it is worth removing?

Copy link
Contributor

@scovich scovich Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, just move the additional assertions to the doc test, hidden behind # if users don't care to see them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a great idea

})?;

let shredding_state =
ShreddedVariantFieldArray::shredding_state_from_struct_array(struct_array)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just ShreddingState::from, out of curiosity?

The operation is actually infallible, it just has a fallible signature at the moment.
And AFAIK there's no difference in how we create the shredding state for top-level vs. object field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea. I did it in #8392 (commit 4b1daab)

@alamb alamb force-pushed the alamb/variant_array_roundtrip branch 2 times, most recently from 385331d to 9f3c2bb Compare September 19, 2025 17:18
@alamb alamb force-pushed the alamb/variant_array_roundtrip branch from 9f3c2bb to ca74726 Compare September 19, 2025 17:25
@alamb alamb changed the title Support reading/writing VariantArray to parquet with Variant LogicalType Path Finding: Support reading/writing VariantArray to parquet with Variant LogicalType Sep 19, 2025
@alamb
Copy link
Contributor Author

alamb commented Sep 19, 2025

I broke out the first chunk of work into its own PR here:

Once that is merged I'll make a PR with the remaining work

alamb added a commit that referenced this pull request Sep 20, 2025
This is a random cleanup I encountered while working on
#8365
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Support reading/writing Parquet Variant LogicalType Support mapping canonical extension types to Parquet LogicalTypes
2 participants