feat: add huggingface native support #5353

Xuanwo · 2025-11-26T13:27:41Z

This PR adds native support of huggingface in lance.

This PR was primarily authored with Codex using GPT-5-Codex and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.

Signed-off-by: Xuanwo <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

rust/lance-io/src/object_store/providers/huggingface.rs

Signed-off-by: Xuanwo <[email protected]>

pavanramkumar · 2025-11-26T17:24:36Z

thanks for this! looks like some tests have been queued for the past 3 hours or so, is that expected?

jackye1995

looks good to me!

jackye1995 · 2025-11-26T17:52:55Z

thanks for this! looks like some tests have been queued for the past 3 hours or so, is that expected?

Yes there are some issues with Github runners right now, we are working on switching to use 3rd party runners

pavanramkumar · 2025-11-26T17:59:25Z

were you already able to test with this public hf dataset @Xuanwo?

>>> hf_path = "hf://datasets/pavan-ramkumar/test-slaf/tree/main/synthetic_50k_processed_v21.slaf/expression.lance"
>>> ds = lance.dataset(hf_path)

jackye1995 · 2025-11-27T00:49:44Z

@Xuanwo can you rebase main and the CI should work now

Signed-off-by: Xuanwo <[email protected]>

…control' into Xuanwo/hf-fragment-control

Xuanwo · 2025-11-27T09:30:52Z

Hi, @pavanramkumar, yes, it works!

However, we didn't support tree/main in the URI since we have dedicated support for revision, which should be passed through storage options. Is it expected?

Signed-off-by: Xuanwo <[email protected]>

codecov · 2025-11-27T12:21:37Z

Codecov Report

❌ Patch coverage is 84.78261% with 28 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...lance-io/src/object_store/providers/huggingface.rs	82.64%	14 Missing and 7 partials ⚠️
rust/lance-io/src/object_store.rs	84.37%	4 Missing and 1 partial ⚠️
rust/lance-io/src/object_store/providers.rs	83.33%	0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

pavanramkumar · 2025-11-27T16:04:50Z

Fantastic! I just tested this now and it works. Thank you!

However, we didn't support tree/main in the URI since we have dedicated support for revision, which should be passed through storage options. Is it expected?

Specifying a revision also works nicely!

>>> ds = lance.dataset(hf_path, storage_options={'revision': 'tree/main'})
>>> ds.sample(1)
pyarrow.Table
cell_integer_id: int32
gene_integer_id: int32
value: float
----
cell_integer_id: [[16445]]
gene_integer_id: [[20436]]
value: [[2.2937665]]

It's not urgent, but I don't see revision as a potential argument to storage_options in the docs. cc @prrao87 @Xuanwo

feat: add huggingface native support

5fade70

Signed-off-by: Xuanwo <[email protected]>

github-actions bot added the enhancement New feature or request label Nov 26, 2025

chatgpt-codex-connector bot reviewed Nov 26, 2025

View reviewed changes

rust/lance-io/src/object_store/providers/huggingface.rs Show resolved Hide resolved

Fix test

c1d29ab

Signed-off-by: Xuanwo <[email protected]>

prrao87 mentioned this pull request Nov 26, 2025

Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub huggingface/datasets#7863

Open

jackye1995 approved these changes Nov 26, 2025

View reviewed changes

Xuanwo added 3 commits November 27, 2025 17:11

Merge branch 'main' into Xuanwo/hf-fragment-control

d33cc99

Fix tests

9bdd5f4

Signed-off-by: Xuanwo <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/Xuanwo/hf-fragment-…

6c20a8b

…control' into Xuanwo/hf-fragment-control

Xuanwo mentioned this pull request Nov 27, 2025

fix(services/huggingface): Allow users to use datasets as an alias to dataset repo type apache/opendal#6826

Merged

Fix tests

63543f7

Signed-off-by: Xuanwo <[email protected]>

Xuanwo merged commit 0204e7e into main Nov 27, 2025
24 of 25 checks passed

Xuanwo deleted the Xuanwo/hf-fragment-control branch November 27, 2025 12:20

prrao87 mentioned this pull request Nov 27, 2025

doc: Specify revision argument for object store configuration #5365

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add huggingface native support #5353

feat: add huggingface native support #5353

Uh oh!

Xuanwo commented Nov 26, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

pavanramkumar commented Nov 26, 2025

Uh oh!

jackye1995 left a comment

Uh oh!

jackye1995 commented Nov 26, 2025

Uh oh!

pavanramkumar commented Nov 26, 2025

Uh oh!

jackye1995 commented Nov 27, 2025

Uh oh!

Xuanwo commented Nov 27, 2025

Uh oh!

Uh oh!

codecov bot commented Nov 27, 2025

Uh oh!

pavanramkumar commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: add huggingface native support #5353

feat: add huggingface native support #5353

Uh oh!

Conversation

Xuanwo commented Nov 26, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

pavanramkumar commented Nov 26, 2025

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented Nov 26, 2025

Uh oh!

pavanramkumar commented Nov 26, 2025

Uh oh!

jackye1995 commented Nov 27, 2025

Uh oh!

Xuanwo commented Nov 27, 2025

Uh oh!

Uh oh!

codecov bot commented Nov 27, 2025

Codecov Report

Uh oh!

pavanramkumar commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants