Skip to content

Conversation

ozkatz
Copy link
Collaborator

@ozkatz ozkatz commented Oct 21, 2025

Add a documentation page describing the use cases and how-to guide on using LanceDB on top of lakeFS

@ozkatz ozkatz requested review from nopcoder and talSofer October 21, 2025 20:42
@ozkatz ozkatz self-assigned this Oct 21, 2025
@ozkatz ozkatz added docs Improvements or additions to documentation exclude-changelog PR description should not be included in next release changelog minor-change Used for PRs that don't require issue attached labels Oct 21, 2025
@github-actions
Copy link

📚 Documentation preview at https://pr-9594.docs-lakefs-preview.io/

Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; added minor comments.

Comment on lines +18 to +20
```python

db = lancedb.connect(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the requirements for installation and import will help


### Multimodal data storage

In many cases, lanceDB stores embeddings of data that exists elsewhere: documents, images, text files etc. These "raw" data files are processed to extract embeddings, which are then stored in lanceDB - but they are also stored in their raw form for retrieval, and in some cases metadata about them is stored in other formats for use in a data warehouse or data lake.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In many cases, lanceDB stores embeddings of data that exists elsewhere: documents, images, text files etc. These "raw" data files are processed to extract embeddings, which are then stored in lanceDB - but they are also stored in their raw form for retrieval, and in some cases metadata about them is stored in other formats for use in a data warehouse or data lake.
In many cases, LanceDB stores embeddings of data that exists elsewhere: documents, images, text files etc. These "raw" data files are processed to extract embeddings, which are then stored in LanceDB - but they are also stored in their raw form for retrieval, and in some cases metadata about them is stored in other formats for use in a data warehouse or data lake.


lakeFS provides a highly performant and scalable way understand how data changes over time. For example, say we store raw image data in `images/` - we can update the raw data by adding, removing or updating images in the `images/` directory - and lakeFS will capture the changes as a commit.

This allows you to perform differential processing of new data: If we have an `images` table in lanceDB we can keep track of the latest commit represented in that table. As new data arrives, we can update our embeddings with the latest commit by diffing the previous commit and the new one, resulting in a minimal set of embeddings to add, remove or update:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This allows you to perform differential processing of new data: If we have an `images` table in lanceDB we can keep track of the latest commit represented in that table. As new data arrives, we can update our embeddings with the latest commit by diffing the previous commit and the new one, resulting in a minimal set of embeddings to add, remove or update:
This allows you to perform differential processing of new data: If we have an `images` table in LanceDB we can keep track of the latest commit represented in that table. As new data arrives, we can update our embeddings with the latest commit by diffing the previous commit and the new one, resulting in a minimal set of embeddings to add, remove or update:

Comment on lines +21 to +23
uri="s3://example-repo/example-branch/path/to/lancedb",
storage_options={
"endpoint": "https://example.lakefs.io",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment these line explain about the s3 endpoint to lakeFS repository and the lakefs endpoint.


# Using lakeFS with LanceDB

[LanceDB](https://lancedb.com/) is a vector database that allows you to store and query vector data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding another line about the use-case of using LanceDB.
Used for storing and querying embeddings often used in AI/ML pipelines... something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation exclude-changelog PR description should not be included in next release changelog minor-change Used for PRs that don't require issue attached

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants