-
Notifications
You must be signed in to change notification settings - Fork 1k
Path Finding: Support reading/writing VariantArray
to parquet with Variant
LogicalType
#8365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
VariantArray
to parquet, VariantArray
does not implement Array
VariantArray
to parquet with correct LogicalType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Initial pass didn't turn up anything scary?
// Note don't check for metadata/value fields here because they may be | ||
// absent in shredded variants | ||
if matches!(data_type, DataType::Struct(_)) { | ||
Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I thought of while working on shredding, variant_get, etc -- what if we supported a metadata that indicated the desired handling of shredded columns? So for example (naming TBD), a caller could annotate a binary variant type with "keep-shredding" to indicate that reads and variant_get calls should preserve the existing shredding structure of the underlying data, or "unshred" to indicate that such calls should return binary variant even if the underlying data are shredded. The output data would not contain any annotation, because it is what it is.
Such an approach would solve the problem of how to request partially shredded results using variant_get
, for example. Today, we can get that behavior by passing as_type=None
, but that doesn't work if I want a nested type with some fields shredded and others left as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the problem, and thus probably don't fully understand your proposal
I would expect that variant_get
would always preserve the existing shredding structure, and if the user wants to unshred the result (e.g. ensure the output of variant_get is an unshredded variant) and that I would specify any conversions as part of variant_get
rather than some annotation on the VariantArray itself
Maybe we could add a variant_unshred
kernel, potentially with a path argument to only partially unshred an object 🤔 I am a little unclear on the usecase for that type of operation though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few cases:
- Pass
as_type=None
-- this would preserve existing shredding structure, if any - Pass
as_type=Some(... VariantType...)
-- what schema should the underlying struct have, to:- Force shredding as a specific schema
- Force unshredding to binary
- Return whatever was there (equivalent to passing None)
- Pass
as_type=Some(StructType(... VariantType ...))
-- this requests to shred the variant value as a struct, but that the field(s) inside that struct are variant-typed (not shredded). The same three questions apply, but nowas_type=None
is not a fallback- We had talked previously about a providing function that could return the shredding schema for a given path, so the user can stub in the variant-typed fields however they want. But that's pretty fiddly.
- The alternative (which I proposed above) was to allow annotating the variant type to indicate that
variant_get
should preserve the underlying shredding structure, without having to actually specify it. Without that annotation, the variant's schema is taken literally and may cause shredding, reshredding, or unshredding depending on the underlying data's structure. - Much cleaner, and actually simpler for the implementation as well. Once the field's path is exhausted, just return whatever was there (similar to how
as_type=None
is implemented today); if we passed an explicit shredding schema instead, then the implementation would have to examine every node of the schema, all the way to leaf level, in case the provided shredding structure doesn't actually match the underlying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the field's path is exhausted, just return whatever was there (similar to how as_type=None is implemented today); if we passed an explicit shredding schema instead, then the implementation would have to examine every node of the schema, all the way to leaf level, in case the provided shredding structure doesn't actually match the underlying.
This is sort of what I expected to happen today.
Maybe we could use the cast_to_variant
kernel for this? For example, if I had a VariantArray with some arbitrary shredding, I could do:
let array: VariantArray = ...; // input somehow
// get the "foo" field (may be shredded).
let foo = variant_get(array, Path::from("foo"))?
// ensure foo is non shredded
let unshredded_foo = cast_to_variant(foo)?;
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario I'm thinking of is a more forgiving replacement for normal json parsing, where the requested schema has a mix of strongly-typed and variant fields. So parse JSON to variant, and then use variant_get
to extract the actual schema of interest. Fields that are known to have inconsistent (or legally flexible) behavior would be kept as variant, with strict typing enforcement applied only to the strongly typed fields (similar to today's JSON parsing with a schema). But in that case, we have a (potentially deeply) nested struct with multiple variant leaf fields, and fetching individual fields one at a time would be pretty annoying.
If the caller wants a specific schema -- whether binary variant or a specific shredding -- it's easy enough to pass that. The difficulty comes if the caller just wants to get back whatever flavor of variant is already there (without shredding or unshredding it first).
I don't see how cast_to_variant
would help in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how cast_to_variant would help in that case?
I don't think it would (I misunderstood what you were getting at)
But in that case, we have a (potentially deeply) nested struct with multiple variant leaf fields, and fetching individual fields one at a time would be pretty annoying.
The difficulty comes if the caller just wants to get back whatever flavor of variant is already there (without shredding or unshredding it first).
I am confused -- if we are parsing JSON, then it will always be variant (no shredded), so I am not sure how there would be different flavors of variant. Maybe this is a good thing for a call or some examples
Also I wonder if maybe we should open a new ticket as this conversation is likely to get pretty lost
Not really -- it seems to be going fine. The biggest issue I hit ss that currently the parquet reader will read Binary columns as Binary rather than BinaryView, which is somewhat awkward. In this PR I still need to implement writing logical types too, and then clean up the code. I hope to have time to finish it later today or tomorrow |
Should we consider just adding support for Binary columns? |
I think we will need to do so eventually, given the arrow spec explicitly says they are supported However, I think we can add that support in parallel (I would like to avoid blocking logical type support on Binary support) |
57c75b3
to
7f7c2ee
Compare
I need to do a little more cleanup tomorrow morning, then this should be ready to go |
VariantArray
to parquet with correct LogicalTypeVariantArray
to parquet with Variant
LogicalType
Tracking with #8387 |
170c67a
to
db66c71
Compare
Update here is I think this PR is now in pretty good shape (it does the Logical type annotations and has examples of reading/writing them) However, it is pretty large to request a review on. My plan is to break out the "remove Array impl` part into a separate PR for easier review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can actually see the end to end story now, nice!
Some(LogicalType::Variant) => { | ||
arrow_field.with_extension_type(parquet_variant_compute::VariantType) | ||
} | ||
// TODO add other LogicalTypes here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, how do the other parquet logical types map to arrow types? Seems like JSON, BSON, UUID, etc will face the same problem as Variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are interleaved in the schema conversion code with cfg
like this:
arrow-rs/parquet/src/arrow/schema/mod.rs
Lines 405 to 412 in c2a1fca
#[cfg(feature = "arrow_canonical_extension_types")] | |
if let Some(logical_type) = basic_info.logical_type() { | |
match logical_type { | |
LogicalType::Uuid => ret.try_with_extension_type(Uuid)?, | |
LogicalType::Json => ret.try_with_extension_type(Json::default())?, | |
_ => {} | |
} | |
} |
I plan to move the code handling the other types into the extension.rs module as a follow on PR
#[cfg(feature = "variant_experimental")] | ||
pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> { | ||
use parquet_variant_compute::VariantType; | ||
if field.extension_type_name()? == VariantType::NAME { | ||
return Some(LogicalType::Variant); | ||
}; | ||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit odd that this API call's existence would be gated by the variant feature?
#[cfg(feature = "variant_experimental")] | |
pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> { | |
use parquet_variant_compute::VariantType; | |
if field.extension_type_name()? == VariantType::NAME { | |
return Some(LogicalType::Variant); | |
}; | |
None | |
pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> { | |
let type_name = field.extension_type_name()?; | |
#[cfg(feature = "variant_experimental")] | |
{ | |
use parquet_variant_compute::VariantType; | |
if type_name == VariantType::NAME { | |
return Some(LogicalType::Variant); | |
} | |
} | |
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, there's a cfg-not version below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But shouldn't we follow the example in other parts of the code, and try_extension_type
rather than merely checking the name? Otherwise, somebody could slap the variant annotation on some completely irrelevant struct, with no validation to catch the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May rationale here (which I will add as a comment) was that creating an ArrowError
is expensive (as it allocates a String) so doing try_extension_type().ok()
would add an allocation to all code paths even those that didn't have Variant type.
I will try and make this clearer, and maybe add some more docs to the extension type code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a flaw in the API. Should it return Option<Result<E>>
to cover the "not an extension type at all" case? Either way, people have to manually check something.
BTW, the JSON extension type handling for string columns is doing the unconditional try_extension_type
, so maybe we should fix that one as well? (either by the API change or by adding a specific check for existence of metadata before making the try_extension_type
call)
parquet/src/variant.rs
Outdated
//! let file = std::fs::File::open(file_path())?; | ||
//! let mut reader = ArrowReaderBuilder::try_new(file)?.build()?; | ||
//! | ||
//! // You can find the Variant using the VariantType extension type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"find" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarified in 0bea137
/// Ensure a file with Variant LogicalType, written by another writer in | ||
/// parquet-testing, can be read as a VariantArray | ||
#[test] | ||
fn read_logical_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the ~same as the doctest above? Do we really need both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right -- it is the same. I wrote this test before the doctest. I'll remove this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked, and this test also does some more stuff like testing sizes. Do you think it is worth removing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, just move the additional assertions to the doc test, hidden behind #
if users don't care to see them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is a great idea
})?; | ||
|
||
let shredding_state = | ||
ShreddedVariantFieldArray::shredding_state_from_struct_array(struct_array)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just ShreddingState::from
, out of curiosity?
The operation is actually infallible, it just has a fallible signature at the moment.
And AFAIK there's no difference in how we create the shredding state for top-level vs. object field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
385331d
to
9f3c2bb
Compare
9f3c2bb
to
ca74726
Compare
VariantArray
to parquet with Variant
LogicalTypeVariantArray
to parquet with Variant
LogicalType
I broke out the first chunk of work into its own PR here: Once that is merged I'll make a PR with the remaining work |
This is a random cleanup I encountered while working on #8365
Which issue does this PR close?
Rationale for this change
We need a way to write Variant encoded data to/from parquet, and the current
way the VariantArray is implemented doesn't work (panics when writing to parquet)
What changes are included in this PR?
Array
impl forVariantArray
, which forces explict conversions back/forth when writingVariant
logical type and add the correct Arrow extension type metadataVariant
logical type when the correct arrow extension type metadataAs a follow on PR I plan to give the same basic treatment to
ShreddedVariantStateArray
Are these changes tested?
yes, by existing tests and new (doc) tests
Are there any user-facing changes?
Yes, but this is not yet stable / released, so these changes have no impact on the releasability of this code
VariantArray
no longer implementsArray