feat(compat): coerce Arrow types DuckDB cannot consume (Float16 -> Float32)#195
Merged
Merged
Conversation
…dary
Lance can emit Arrow primitive types DuckDB has no bundled mapping for —
today FloatingPoint(HALF), which surfaces as
[UNSUPPORTED_ARROWTYPE] Unsupported arrow type FloatingPoint(HALF).
at the first SELECT.
Adds a thin reader-boundary compat layer:
* lance_arrow_compat.{hpp,cpp} — a CoercionRule registry. Today's single
rule is Float16 -> Float32 (format "e" -> "f"; IEEE 754 widen that
matches lance-spark's Float16Utils.halfToFloat bit-for-bit). Future
unsupported primitives slot in as one row in kRules[].
* LanceCoerceArrowSchemaForDuckDB walks ArrowSchema (including children
and dictionary), swaps format pointers to the coerced format via a
wrapping release callback so the producer's release still works, and
returns top-level column names that were coerced.
* LanceCoerceArrowArrayForDuckDB mirrors on the per-batch ArrowArray,
allocating new buffers and installing a wrapping release.
Wired at every Lance -> DuckDB boundary: scan / exec-IR / KNN / FTS binds,
batch loads, deferred-take, MERGE take, catalog schema generators.
Write-path guard: coercion is destructive on the write side — DuckDB would
round-trip FLOAT back to Lance and silently widen the HALF storage. The
catalog entry now carries the list of coerced columns; INSERT / UPDATE /
MERGE throw NotImplementedException naming the affected columns.
Adds sqllogic test covering scalar / FixedSizeList / variable List /
Struct-with-HALF / List-of-Struct-with-HALF / Map-to-HALF, plus the
write guards, and a pylance fixture generator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three CI fixes for PR lance-format#195: * make format-fix — clang-format 11 on the three new C++ files and black==24 on the fixture generator. * Commit the generated test/data/float16_fixture.lance so the sqllogic test has an ATTACH target in CI. * Drop the Map<Utf8, Float16> case: Lance's Rust writer currently errors with "not yet implemented: Implement encoding for field half_map (halffloat)" so the shape cannot be round-tripped through a fixture. Schema-side Map<_, Float16> coercion is still exercised in the lance-spark LanceArrowUtilsSuite on the JVM side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DuckDB's `ClientContextState::registered_state` is an `unordered_map`, so the iteration order at profile-time is hash-dependent: MSVC STL prints "Lance Dataset Cache" before "Lance Session Cache", while libstdc++ and libc++ print them the other way. The prior regex locked in one order and blew up on Windows. Accept either ordering via alternation. Note: PR lance-format#188 (which introduced this test) was never exercised on Windows because the distribution pipeline only triggers on CMakeLists.txt / vcpkg.json changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream Lance panics in conflict_resolver.rs:1659 when a concurrently- VACUUM'd manifest is needed by a peer commit — repro'd only on Windows runners. Gate the test with `require notwindows` until the upstream race is fixed. Linux/macOS coverage is unchanged. This test has never been exercised on Windows before because the Distribution workflow only triggers on CMakeLists.txt / vcpkg.json changes, and PRs lance-format#188/lance-format#189 (which added this test) didn't touch either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Xuanwo
requested changes
May 7, 2026
Xuanwo
left a comment
Collaborator
There was a problem hiding this comment.
LanceTableEntry::Copy() should preserve coerced_column_names (see src/lance_scan.cpp, around the copied entry construction). The new INSERT/UPDATE/MERGE guards rely on HasCoercedColumns(), so a copied catalog entry can expose the widened FLOAT schema while losing the write-protection state. Please copy the coerced column list into the returned entry.
Without this, a copied catalog entry exposes the widened FLOAT schema but HasCoercedColumns() returns false, silently dropping the INSERT/UPDATE/MERGE write protection and risking on-disk corruption. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@Xuanwo I have made the changes as requested, pls take a look again. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Teach the Lance → DuckDB reader boundary to transparently coerce Arrow primitives DuckDB has no mapping for, starting with Float16 (FloatingPoint(HALF)).
Before this PR, any
SELECTagainst a Lance dataset with afloat16column — scalar,fixed_size_list<float16>, or nested inside struct/list/map — fails at bind time:DuckDB's bundled Arrow primitive decoder (
arrow_duck_schema.cpp::GetTypeFromFormat) is hardcoded and has no plugin API for adding primitive formats.ArrowTypeExtensionis only for named Arrow extension types (ARROW:extension:name), not primitive format codes. So a reader-boundary shim on our side is the right layer.Design
New
src/lance_arrow_compat.{hpp,cpp}with a tiny rule registry — adding a future unsupported primitive is one row:Two umbrella helpers walk schema + array (including
children[]anddictionary):LanceCoerceArrowSchemaForDuckDB(ArrowSchema *)— rewrites format pointers via a wrapping release so the producer's release still frees the original string. Returns top-level column names that were coerced.LanceCoerceArrowArrayForDuckDB(const ArrowSchema *, ArrowArray *)— allocates new buffers, converts element-wise (HalfToFloatmatches lance-spark'sFloat16Utils.halfToFloatbit-for-bit), installs a wrapping release.Wired at every Lance → DuckDB boundary: scan / exec-IR / KNN / FTS binds, batch loads, deferred-take, MERGE take, catalog schema generators.
Write-path guard
The coercion is intentionally asymmetric: if we let INSERT / UPDATE / MERGE flow, DuckDB would hand back
FLOATand we'd silently widen the on-diskHALFstorage. Instead:LanceTableEntrynow carries avector<string> coerced_column_names(populated when the catalog bind sees any rule fire).PlanLanceInsertAppend,PlanLanceUpdateOverwrite,PlanLanceMergeIntothrowNotImplementedExceptionnaming the affected columns — no silent storage corruption.Tests
test/sql/float16_widening.testexercises every recursion branch against a pylance fixture:float16fixed_size_list<float16, 3>list<float16>struct<float16, float32>+ struct field projectionlist<struct<float16>>(struct-in-list)map<varchar, float16>Fixture generator at
test/scripts/gen_float16_fixture.py(pylance + pyarrow).Related
float[]ordouble[]#117) addressedlist<float>vsfixed_size_list<float>shape mismatches, not unsupported primitives.Float16Utils.halfToFloatwidening (commit821a436), so cross-engine reads surface identical values.Test plan
test/scripts/gen_float16_fixture.py(requires pylance + pyarrow) to generatetest/data/float16_fixture.lanceand commit it, so CI picks up the new sqllogic test.make test(or equivalent) to run the newfloat16_widening.testlocally.select raw_video_index from lance.all_video_shot limit 5against a real Float16-carrying dataset and confirm it now returns rows.INSERT INTO <table-with-half-col>and confirm the clearNotImplementedExceptionfires (no silent corruption).🤖 Generated with Claude Code