Skip to content

feat(compat): coerce Arrow types DuckDB cannot consume (Float16 -> Float32)#195

Merged
Xuanwo merged 6 commits into
lance-format:mainfrom
jiaoew1991:fix/half-float-arrow-type
May 8, 2026
Merged

feat(compat): coerce Arrow types DuckDB cannot consume (Float16 -> Float32)#195
Xuanwo merged 6 commits into
lance-format:mainfrom
jiaoew1991:fix/half-float-arrow-type

Conversation

@jiaoew1991

Copy link
Copy Markdown
Contributor

Summary

Teach the Lance → DuckDB reader boundary to transparently coerce Arrow primitives DuckDB has no mapping for, starting with Float16 (FloatingPoint(HALF)).

Before this PR, any SELECT against a Lance dataset with a float16 column — scalar, fixed_size_list<float16>, or nested inside struct/list/map — fails at bind time:

[UNSUPPORTED_ARROWTYPE] Unsupported arrow type FloatingPoint(HALF).

DuckDB's bundled Arrow primitive decoder (arrow_duck_schema.cpp::GetTypeFromFormat) is hardcoded and has no plugin API for adding primitive formats. ArrowTypeExtension is only for named Arrow extension types (ARROW:extension:name), not primitive format codes. So a reader-boundary shim on our side is the right layer.

Design

New src/lance_arrow_compat.{hpp,cpp} with a tiny rule registry — adding a future unsupported primitive is one row:

struct CoercionRule {
  bool (*matches)(const char *format);
  const char *coerced_format;
  size_t src_element_size;
  size_t coerced_element_size;
  void (*convert)(const void *src, void *dst, int64_t count);
};

constexpr CoercionRule kRules[] = {
    {MatchesFloat16, kFloat32FormatLiteral, sizeof(uint16_t), sizeof(float),
     ConvertFloat16ToFloat32},
    // Future types: add one row here. Call sites don't change.
};

Two umbrella helpers walk schema + array (including children[] and dictionary):

  • LanceCoerceArrowSchemaForDuckDB(ArrowSchema *) — rewrites format pointers via a wrapping release so the producer's release still frees the original string. Returns top-level column names that were coerced.
  • LanceCoerceArrowArrayForDuckDB(const ArrowSchema *, ArrowArray *) — allocates new buffers, converts element-wise (HalfToFloat matches lance-spark's Float16Utils.halfToFloat bit-for-bit), installs a wrapping release.

Wired at every Lance → DuckDB boundary: scan / exec-IR / KNN / FTS binds, batch loads, deferred-take, MERGE take, catalog schema generators.

Write-path guard

The coercion is intentionally asymmetric: if we let INSERT / UPDATE / MERGE flow, DuckDB would hand back FLOAT and we'd silently widen the on-disk HALF storage. Instead:

  • LanceTableEntry now carries a vector<string> coerced_column_names (populated when the catalog bind sees any rule fire).
  • PlanLanceInsertAppend, PlanLanceUpdateOverwrite, PlanLanceMergeInto throw NotImplementedException naming the affected columns — no silent storage corruption.
INSERT into Lance table 'videos' is not supported: column(s) [raw_video_index]
have Arrow types DuckDB cannot represent natively, so the catalog exposes a
coerced type. Writing in the coerced type would silently change the on-disk
storage.

Tests

test/sql/float16_widening.test exercises every recursion branch against a pylance fixture:

  • scalar float16
  • fixed_size_list<float16, 3>
  • variable-size list<float16>
  • struct<float16, float32> + struct field projection
  • list<struct<float16>> (struct-in-list)
  • map<varchar, float16>
  • pushdown filter + projection-only scan
  • write guards (INSERT / UPDATE error assertions)

Fixture generator at test/scripts/gen_float16_fixture.py (pylance + pyarrow).

Related

Test plan

  • Run test/scripts/gen_float16_fixture.py (requires pylance + pyarrow) to generate test/data/float16_fixture.lance and commit it, so CI picks up the new sqllogic test.
  • make test (or equivalent) to run the new float16_widening.test locally.
  • Reproduce the original select raw_video_index from lance.all_video_shot limit 5 against a real Float16-carrying dataset and confirm it now returns rows.
  • Try INSERT INTO <table-with-half-col> and confirm the clear NotImplementedException fires (no silent corruption).

🤖 Generated with Claude Code

jiaoew1991 and others added 3 commits April 20, 2026 16:24
…dary

Lance can emit Arrow primitive types DuckDB has no bundled mapping for —
today FloatingPoint(HALF), which surfaces as
  [UNSUPPORTED_ARROWTYPE] Unsupported arrow type FloatingPoint(HALF).
at the first SELECT.

Adds a thin reader-boundary compat layer:

  * lance_arrow_compat.{hpp,cpp} — a CoercionRule registry. Today's single
    rule is Float16 -> Float32 (format "e" -> "f"; IEEE 754 widen that
    matches lance-spark's Float16Utils.halfToFloat bit-for-bit). Future
    unsupported primitives slot in as one row in kRules[].
  * LanceCoerceArrowSchemaForDuckDB walks ArrowSchema (including children
    and dictionary), swaps format pointers to the coerced format via a
    wrapping release callback so the producer's release still works, and
    returns top-level column names that were coerced.
  * LanceCoerceArrowArrayForDuckDB mirrors on the per-batch ArrowArray,
    allocating new buffers and installing a wrapping release.

Wired at every Lance -> DuckDB boundary: scan / exec-IR / KNN / FTS binds,
batch loads, deferred-take, MERGE take, catalog schema generators.

Write-path guard: coercion is destructive on the write side — DuckDB would
round-trip FLOAT back to Lance and silently widen the HALF storage. The
catalog entry now carries the list of coerced columns; INSERT / UPDATE /
MERGE throw NotImplementedException naming the affected columns.

Adds sqllogic test covering scalar / FixedSizeList / variable List /
Struct-with-HALF / List-of-Struct-with-HALF / Map-to-HALF, plus the
write guards, and a pylance fixture generator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three CI fixes for PR lance-format#195:

  * make format-fix — clang-format 11 on the three new C++ files and
    black==24 on the fixture generator.
  * Commit the generated test/data/float16_fixture.lance so the sqllogic
    test has an ATTACH target in CI.
  * Drop the Map<Utf8, Float16> case: Lance's Rust writer currently
    errors with "not yet implemented: Implement encoding for field
    half_map (halffloat)" so the shape cannot be round-tripped through a
    fixture. Schema-side Map<_, Float16> coercion is still exercised in
    the lance-spark LanceArrowUtilsSuite on the JVM side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DuckDB's `ClientContextState::registered_state` is an `unordered_map`, so
the iteration order at profile-time is hash-dependent: MSVC STL prints
"Lance Dataset Cache" before "Lance Session Cache", while libstdc++ and
libc++ print them the other way. The prior regex locked in one order and
blew up on Windows.

Accept either ordering via alternation. Note: PR lance-format#188 (which introduced
this test) was never exercised on Windows because the distribution
pipeline only triggers on CMakeLists.txt / vcpkg.json changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jiaoew1991 jiaoew1991 requested a review from Xuanwo April 21, 2026 17:03
Upstream Lance panics in conflict_resolver.rs:1659 when a concurrently-
VACUUM'd manifest is needed by a peer commit — repro'd only on Windows
runners. Gate the test with `require notwindows` until the upstream
race is fixed. Linux/macOS coverage is unchanged.

This test has never been exercised on Windows before because the
Distribution workflow only triggers on CMakeLists.txt / vcpkg.json
changes, and PRs lance-format#188/lance-format#189 (which added this test) didn't touch either.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Xuanwo Xuanwo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LanceTableEntry::Copy() should preserve coerced_column_names (see src/lance_scan.cpp, around the copied entry construction). The new INSERT/UPDATE/MERGE guards rely on HasCoercedColumns(), so a copied catalog entry can expose the widened FLOAT schema while losing the write-protection state. Please copy the coerced column list into the returned entry.

jiaoew1991 and others added 2 commits May 7, 2026 09:32
Without this, a copied catalog entry exposes the widened FLOAT schema
but HasCoercedColumns() returns false, silently dropping the
INSERT/UPDATE/MERGE write protection and risking on-disk corruption.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jiaoew1991

Copy link
Copy Markdown
Contributor Author

@Xuanwo I have made the changes as requested, pls take a look again.

@Xuanwo Xuanwo merged commit 4fae532 into lance-format:main May 8, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants