Skip to content

Add native VARIANT data type support#2313

Open
thohan88 wants to merge 3 commits intoduckdb:mainfrom
thohan88:feature/variant
Open

Add native VARIANT data type support#2313
thohan88 wants to merge 3 commits intoduckdb:mainfrom
thohan88:feature/variant

Conversation

@thohan88
Copy link
Copy Markdown

Adds R-side glue code to decode DuckDB's VARIANT columns (>=1.5.0) into native R list objects instead of erroring on fetch. I have tried to keep this minimal to mainly address #2308 for supporting reads as I suspect R → VARIANT conversion is a different beast.

What this does

  • Registers VARIANT as a known type across the R type system (VECSXP / "list")
  • Decodes each VARIANT row via VariantUtils::ConvertVariantToValueValueToSexp, producing native R objects:
    • Primitives → scalar R values (logical, integer, double, character)
    • Objects → named R lists
    • Arrays → unnamed R lists
    • NULL → NULL
    • Timestamps/dates inside variants → POSIXct/Date with proper attributes
  • Extends ValueToSexp to handle LIST, MAP, STRUCT, and UNION value types so that nested VARIANT structures recursively produce the correct R representation
  • Changes ValueToSexp signature to const Value& / const string& for const-correctness
  • Uses convert_opts.timezone_out for timezone-aware conversion of timestamps inside variants

What this does NOT do

  • No vectorized fast-path: Each row materializes an intermediate Value object before converting to SEXP (Variant binary → Value → SEXP). This works correctly but creates extra allocations for large columns. A future improvement could use DuckDB's VariantVisitor<T> template to go directly from variant binary to SEXP, skipping the intermediate Value tree.
  • No R → VARIANT write support: R objects cannot be written back as VARIANT columns via dbWriteTable or parameterized queries. Users can work around this by inserting via SQL with explicit ::VARIANT casts. (Mapping arbitrary R lists to DuckDB VARIANT is non-trivial.)

Files changed (5 files, +195 / −2)

File Change
src/transform.cpp Add VARIANT case to duckdb_r_typeof, duckdb_r_decorate, and duckdb_r_transform
src/types.cpp Register VARIANT"list" in DetectLogicalType
src/utils.cpp Add LIST/MAP/STRUCT/UNION cases to ValueToSexp; const signature
src/include/typesr.hpp Update ValueToSexp declaration to const
tests/testthat/test-variant.R 8 test cases (37 expectations): primitives, empty containers, deep nesting, timestamps, dates, batched fetch, heterogeneous objects, LIST(VARIANT) composability

Testing

  • testthat::test_local(filter = "^variant$") — 37/37 pass
  • Full suite — 1265 pass, 0 new failures

Note

Implemented with the help of Claude Opus.

Copy link
Copy Markdown
Collaborator

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Can we now unlock the variant type in the test_all_types() test?

Comment thread src/utils.cpp Outdated
Comment on lines +328 to +329
case LogicalTypeId::LIST:
case LogicalTypeId::MAP: {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this code? Where is it tested? Should it be a separate PR?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this code?

The ValueToSexp function converts a DuckDB Value to an R SEXP. Before this PR, it only handled scalar types (BOOLEAN, INTEGER, DOUBLE, VARCHAR, TIMESTAMP, etc.) and fell through to throw NotImplementedException for compound types like LIST, MAP, STRUCT, and UNION.

These cases are needed because the VARIANT transform path works like this:

  1. In transform.cpp:672 VariantUtils::ConvertVariantToValue(...) decodes the shredded binary into a Value
  2. That Value is then passed to ValueToSexp(val, ...) to convert it to R

A VARIANT can contain any DuckDB type, including lists, structs, maps, and unions. Without these new cases, decoding {'a': 1, 'b': [2, 3]}::VARIANT would crash at ValueToSexp when it hits the struct Value containing a nested list Value.

Where is it tested:

By several tests in test-variant.R:

  • Line 13: {'a': 1, 'b': [2, 3]}::VARIANT: exercises STRUCT → ValueToSexp → LIST → ValueToSexp recursion
  • Line 14: [4, 5, 6]::VARIANT: exercises LIST case
  • Line 49: list_value()::VARIANT: empty list
  • Line 54: MAP()::VARIANT: MAP case
  • Line 62: deeply nested struct/list
  • Line 123: heterogeneous struct

Should it be a separate PR?

Maybe. These cases only exist to support the VARIANT feature. Before VARIANT, ValueToSexp was only called from register.cpp:168 for Arrow filter pushdown with scalar constant values (where LIST, MAP and STRUCT constants are extremely unlikely). IMO, they are probably not independently useful enough to justify a standalone PR. Splitting them out would mean this PR would not compile on its own, but I am happy to make a stacked PR if you prefer that.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we already convert structs and maps from DuckDB to R. I wonder if we could reuse that code path instead, or somehow combine these code paths. (Without looking at it too closely, should the "struct" case return a data frame instead of a bare list?)

Is this an implication of the first non-goal? If the advantage of going straight from binary to data is code reuse, I'd prefer that.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first non-goal is about avoiding the intermediate Value allocation by using VariantVisitor<T> to go straight from binary to SEXP to improve performance.

It would not enable reuse of the existing transform.cpp code though, because the two paths operate on different data representations:

  • Existing LIST, MAP, STRUCT code in transform.cpp works on columnar Vector objects (e.g. ListVector::GetData, StructVector::GetEntries),
  • Upstream VariantUtils::ConvertVariantToValue returns a scalar Value per row.

Reusing duckdb_r_transform from a Value would mean wrapping each one in a single-element Vector and going through the full allocate/decorate/transform pipeline per cell, which is doable but adds per-cell overhead compared to the direct recursive conversion.

Good catch that the struct case should return a data frame (not a bare list) and the map case should return a key/value data frame matching existing behavior in transform.cpp. I can update ValueToSexp to add the proper duckdb_r_df_decorate calls for consistency, but fully understand if you would like to explore a different direction for code reuse.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how code reuse is hard here, and perhaps not even attractive. Does the following property always hold: a column returned from DuckDB wrapped into a list (element-wise) gives the same result as coercing that column to a variant and returning from DuckDB? What are the exceptions?

We could then use test_all_types() to check for this property, this ensures consistency moving forward.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After attempting a couple of alternatives, I think reusing duckdb_r_transform is a viable option.

I have refactored ValueToSexp to wrap individual values in a single length Vector and funnel them through the existing transformation pipeline (duckdb_r_allocate -> duckdb_r_decorate -> duckdb_r_transform). Doing so achieves structural parity, code reuse and respect for connection-level settings (e.g. bigint and timezone_out). We also get rid of the existing manual type switch logic in utils.cpp which was only used for arrow filter pushdown.

It introduces a small per-cell overhead by wrapping values in a single-element Vector, but I believe the gains make it worth the trade-off. I do not think scalar conversion logic will be the bottleneck for VARIANT data anyway.

All tests passed and confirmed types match when round-tripped through VARIANT.

@thohan88
Copy link
Copy Markdown
Author

Thanks. Can we now unlock the variant type in the test_all_types() test?

I do not see any entries for variant columns in the vendored test_all_types.cpp yet. Once it is added upstream, I would assume it would work out of the box with this transform logic.

@thohan88
Copy link
Copy Markdown
Author

I have updated the branch and squashed the history to suggest a cleaner implementation which ensures type parity and makes use of existing code. Polite disclaimer that C++ is not my primary language, but I have attempted to follow the existing style with some assistance from Claude Opus.

See comment: #2313 (comment)

After attempting a couple of alternatives, I think reusing duckdb_r_transform is a viable option.

I have refactored ValueToSexp to wrap individual values in a single length Vector and funnel them through the existing transformation pipeline (duckdb_r_allocate -> duckdb_r_decorate -> duckdb_r_transform). Doing so achieves structural parity, code reuse and respect for connection-level settings (e.g. bigint and timezone_out). We also get rid of the existing manual type switch logic in utils.cpp which was only used for arrow filter pushdown.

It introduces a small per-cell overhead by wrapping values in a single-element Vector, but I believe the gains make it worth the trade-off. I do not think scalar conversion logic will be the bottleneck for VARIANT data anyway.

All tests passed and confirmed types match when round-tripped through VARIANT.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.90909% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.44%. Comparing base (5c88c6f) to head (9bff756).
⚠️ Report is 3561 commits behind head on main.

Files with missing lines Patch % Lines
src/utils.cpp 84.21% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2313       +/-   ##
===========================================
- Coverage   63.61%   51.44%   -12.18%     
===========================================
  Files         123      154       +31     
  Lines        5266    10377     +5111     
===========================================
+ Hits         3350     5338     +1988     
- Misses       1916     5039     +3123     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants