Add native VARIANT data type support#2313
Conversation
krlmlr
left a comment
There was a problem hiding this comment.
Thanks. Can we now unlock the variant type in the test_all_types() test?
| case LogicalTypeId::LIST: | ||
| case LogicalTypeId::MAP: { |
There was a problem hiding this comment.
What's the purpose of this code? Where is it tested? Should it be a separate PR?
There was a problem hiding this comment.
What's the purpose of this code?
The ValueToSexp function converts a DuckDB Value to an R SEXP. Before this PR, it only handled scalar types (BOOLEAN, INTEGER, DOUBLE, VARCHAR, TIMESTAMP, etc.) and fell through to throw NotImplementedException for compound types like LIST, MAP, STRUCT, and UNION.
These cases are needed because the VARIANT transform path works like this:
- In transform.cpp:672
VariantUtils::ConvertVariantToValue(...)decodes the shredded binary into a Value - That Value is then passed to
ValueToSexp(val, ...)to convert it to R
A VARIANT can contain any DuckDB type, including lists, structs, maps, and unions. Without these new cases, decoding {'a': 1, 'b': [2, 3]}::VARIANT would crash at ValueToSexp when it hits the struct Value containing a nested list Value.
Where is it tested:
By several tests in test-variant.R:
- Line 13:
{'a': 1, 'b': [2, 3]}::VARIANT: exercises STRUCT → ValueToSexp → LIST → ValueToSexp recursion - Line 14:
[4, 5, 6]::VARIANT: exercises LIST case - Line 49:
list_value()::VARIANT: empty list - Line 54:
MAP()::VARIANT: MAP case - Line 62: deeply nested struct/list
- Line 123: heterogeneous struct
Should it be a separate PR?
Maybe. These cases only exist to support the VARIANT feature. Before VARIANT, ValueToSexp was only called from register.cpp:168 for Arrow filter pushdown with scalar constant values (where LIST, MAP and STRUCT constants are extremely unlikely). IMO, they are probably not independently useful enough to justify a standalone PR. Splitting them out would mean this PR would not compile on its own, but I am happy to make a stacked PR if you prefer that.
There was a problem hiding this comment.
IIRC, we already convert structs and maps from DuckDB to R. I wonder if we could reuse that code path instead, or somehow combine these code paths. (Without looking at it too closely, should the "struct" case return a data frame instead of a bare list?)
Is this an implication of the first non-goal? If the advantage of going straight from binary to data is code reuse, I'd prefer that.
There was a problem hiding this comment.
The first non-goal is about avoiding the intermediate Value allocation by using VariantVisitor<T> to go straight from binary to SEXP to improve performance.
It would not enable reuse of the existing transform.cpp code though, because the two paths operate on different data representations:
- Existing LIST, MAP, STRUCT code in transform.cpp works on columnar
Vectorobjects (e.g.ListVector::GetData,StructVector::GetEntries), - Upstream
VariantUtils::ConvertVariantToValuereturns a scalarValueper row.
Reusing duckdb_r_transform from a Value would mean wrapping each one in a single-element Vector and going through the full allocate/decorate/transform pipeline per cell, which is doable but adds per-cell overhead compared to the direct recursive conversion.
Good catch that the struct case should return a data frame (not a bare list) and the map case should return a key/value data frame matching existing behavior in transform.cpp. I can update ValueToSexp to add the proper duckdb_r_df_decorate calls for consistency, but fully understand if you would like to explore a different direction for code reuse.
There was a problem hiding this comment.
I see how code reuse is hard here, and perhaps not even attractive. Does the following property always hold: a column returned from DuckDB wrapped into a list (element-wise) gives the same result as coercing that column to a variant and returning from DuckDB? What are the exceptions?
We could then use test_all_types() to check for this property, this ensures consistency moving forward.
There was a problem hiding this comment.
After attempting a couple of alternatives, I think reusing duckdb_r_transform is a viable option.
I have refactored ValueToSexp to wrap individual values in a single length Vector and funnel them through the existing transformation pipeline (duckdb_r_allocate -> duckdb_r_decorate -> duckdb_r_transform). Doing so achieves structural parity, code reuse and respect for connection-level settings (e.g. bigint and timezone_out). We also get rid of the existing manual type switch logic in utils.cpp which was only used for arrow filter pushdown.
It introduces a small per-cell overhead by wrapping values in a single-element Vector, but I believe the gains make it worth the trade-off. I do not think scalar conversion logic will be the bottleneck for VARIANT data anyway.
All tests passed and confirmed types match when round-tripped through VARIANT.
I do not see any entries for variant columns in the vendored test_all_types.cpp yet. Once it is added upstream, I would assume it would work out of the box with this transform logic. |
8d11872 to
51c2672
Compare
|
I have updated the branch and squashed the history to suggest a cleaner implementation which ensures type parity and makes use of existing code. Polite disclaimer that C++ is not my primary language, but I have attempted to follow the existing style with some assistance from Claude Opus. See comment: #2313 (comment)
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2313 +/- ##
===========================================
- Coverage 63.61% 51.44% -12.18%
===========================================
Files 123 154 +31
Lines 5266 10377 +5111
===========================================
+ Hits 3350 5338 +1988
- Misses 1916 5039 +3123 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Adds R-side glue code to decode DuckDB's
VARIANTcolumns (>=1.5.0) into native R list objects instead of erroring on fetch. I have tried to keep this minimal to mainly address #2308 for supporting reads as I suspect R → VARIANT conversion is a different beast.What this does
VARIANTas a known type across the R type system (VECSXP/"list")VariantUtils::ConvertVariantToValue→ValueToSexp, producing native R objects:logical,integer,double,character)NULLPOSIXct/Datewith proper attributesValueToSexpto handleLIST,MAP,STRUCT, andUNIONvalue types so that nested VARIANT structures recursively produce the correct R representationValueToSexpsignature toconst Value&/const string&for const-correctnessconvert_opts.timezone_outfor timezone-aware conversion of timestamps inside variantsWhat this does NOT do
Valueobject before converting to SEXP (Variant binary → Value → SEXP). This works correctly but creates extra allocations for large columns. A future improvement could use DuckDB'sVariantVisitor<T>template to go directly from variant binary to SEXP, skipping the intermediateValuetree.dbWriteTableor parameterized queries. Users can work around this by inserting via SQL with explicit::VARIANTcasts. (Mapping arbitrary R lists to DuckDB VARIANT is non-trivial.)Files changed (5 files, +195 / −2)
src/transform.cppVARIANTcase toduckdb_r_typeof,duckdb_r_decorate, andduckdb_r_transformsrc/types.cppVARIANT→"list"inDetectLogicalTypesrc/utils.cppLIST/MAP/STRUCT/UNIONcases toValueToSexp;constsignaturesrc/include/typesr.hppValueToSexpdeclaration toconsttests/testthat/test-variant.RLIST(VARIANT)composabilityTesting
testthat::test_local(filter = "^variant$")— 37/37 passNote
Implemented with the help of Claude Opus.