Skip to content

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Nov 14, 2025

This PR is based on #5210


This PR intends to add a blob arrow extension type (aka, logical type) in lance.

This PR was primarily authored with Codex using GPT-5-Codex and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.

@github-actions github-actions bot added the enhancement New feature or request label Nov 14, 2025
Signed-off-by: Xuanwo <[email protected]>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 14 to 26
use lance_core::{utils::address::RowAddress, Error, Result};
use lance_io::traits::Reader;

pub const BLOB_VERSION_CONFIG_KEY: &str = "lance.blob.version";

pub fn blob_version_from_config(config: &HashMap<String, String>) -> BlobVersion {
config
.get(BLOB_VERSION_CONFIG_KEY)
.and_then(|value| BlobVersion::from_config_value(value))
.unwrap_or(BlobVersion::V1)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Respect legacy blob version metadata when config is absent

Blob version detection now relies exclusively on the dataset config (blob_version_from_config) and BlobVersion::from_config_value, while the previous code read the lance-encoding:blob-version field metadata. Existing tables written with v2 blobs before this change store the version only in field metadata and have no config entry. After this commit such tables will always default to BlobVersion::V1, so blob descriptors are parsed with the wrong layout and reads will fail or return incorrect data. Consider falling back to the per-field metadata key when the config value is missing so older manifests remain readable.

Useful? React with 👍 / 👎.

Comment on lines 1030 to 1037
self
}

pub fn with_blob_version(mut self, blob_version: BlobVersion) -> Self {
self.blob_version = blob_version;
self
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate dataset blob version to all projections

The new Projection field blob_version defaults to BlobVersion::V1 and is only set when callers explicitly invoke with_blob_version. Many existing call sites still construct projections directly via Projection::full(...) or Projection::empty(...) (e.g. the scanner and DataFusion code) and will continue to operate with the default. On tables using the new blob v2 layout those projections will still expand blobs using the v1 descriptor schema, leading to mismatched field shapes and runtime errors. Either derive the blob version from the Projectable passed to Projection::full/empty or update all constructors to set the dataset’s blob version by default.

Useful? React with 👍 / 👎.

@codecov-commenter
Copy link

codecov-commenter commented Nov 14, 2025

Codecov Report

❌ Patch coverage is 96.61017% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.24%. Comparing base (27d8ac5) to head (00b67f4).

Files with missing lines Patch % Lines
rust/lance-arrow/src/schema.rs 75.00% 0 Missing and 1 partial ⚠️
rust/lance-core/src/datatypes/field.rs 98.03% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5239      +/-   ##
==========================================
+ Coverage   82.21%   82.24%   +0.03%     
==========================================
  Files         344      344              
  Lines      144879   144937      +58     
  Branches   144879   144937      +58     
==========================================
+ Hits       119108   119202      +94     
+ Misses      21836    21798      -38     
- Partials     3935     3937       +2     
Flag Coverage Δ
unittests 82.24% <96.61%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Xuanwo Xuanwo merged commit d56718d into main Nov 19, 2025
25 checks passed
@Xuanwo Xuanwo deleted the blob-v2-external branch November 19, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants