Skip to content

[EPIC] Support VARIANT type for unstructured data #16116

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Processing semi-structured data (basically think anything that can be represented in JSON) efficiently is becoming more and more important.

As @wjones127 says in https://github.com/apache/datafusion/issues/10987>

This would be a high-performance data type for semi-structured data, designed for better OLAP performance than JSON or BSON (discussed in #7845).

While it is certainly possible to implement semi-structured, JSON and even Variant support today using the DataFusion extension apis (e.g. https://github.com/datafusion-contrib/datafusion-functions-json) this ticket tracks adding such support to DataFusion itself

Parquet recently adopted the Variant type : https://github.com/apache/parquet-format/blob/master/VariantEncoding.md

We see adoption of this in other systems as well such as Iceberg and Spark.

I think DataBricks did a good job describing its rationale:

Without Variant, customers had to choose between flexibility and performance. To maintain flexibility, customers would store JSON in single columns as strings. To see better performance, customers would apply strict schematizing approaches with structs, which requires separate processes to maintain and update with schema changes. With Variant, customers can retain flexibility (there's no need to define an explicit schema) and receive vastly improved performance compared to querying the JSON as a string.

Describe the solution you'd like

No response

Describe alternatives you've considered

This will be a big project. Here are some of the related pre-requisites

It is not clear to me if variant should be "built in" or if it should be an add on (for example, add a variant feature and a datafusion-variant crate)

Additional context

Related tickets

Metadata

Metadata

Assignees

No one assigned

    Labels

    PROPOSAL EPICA proposal being discussed that is not yet fully underwayenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions