-
Notifications
You must be signed in to change notification settings - Fork 402
Docs: add lancedb integration page #9594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
📚 Documentation preview at https://pr-9594.docs-lakefs-preview.io/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm; added minor comments.
```python | ||
|
||
db = lancedb.connect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the requirements for installation and import will help
|
||
### Multimodal data storage | ||
|
||
In many cases, lanceDB stores embeddings of data that exists elsewhere: documents, images, text files etc. These "raw" data files are processed to extract embeddings, which are then stored in lanceDB - but they are also stored in their raw form for retrieval, and in some cases metadata about them is stored in other formats for use in a data warehouse or data lake. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many cases, lanceDB stores embeddings of data that exists elsewhere: documents, images, text files etc. These "raw" data files are processed to extract embeddings, which are then stored in lanceDB - but they are also stored in their raw form for retrieval, and in some cases metadata about them is stored in other formats for use in a data warehouse or data lake. | |
In many cases, LanceDB stores embeddings of data that exists elsewhere: documents, images, text files etc. These "raw" data files are processed to extract embeddings, which are then stored in LanceDB - but they are also stored in their raw form for retrieval, and in some cases metadata about them is stored in other formats for use in a data warehouse or data lake. |
|
||
lakeFS provides a highly performant and scalable way understand how data changes over time. For example, say we store raw image data in `images/` - we can update the raw data by adding, removing or updating images in the `images/` directory - and lakeFS will capture the changes as a commit. | ||
|
||
This allows you to perform differential processing of new data: If we have an `images` table in lanceDB we can keep track of the latest commit represented in that table. As new data arrives, we can update our embeddings with the latest commit by diffing the previous commit and the new one, resulting in a minimal set of embeddings to add, remove or update: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows you to perform differential processing of new data: If we have an `images` table in lanceDB we can keep track of the latest commit represented in that table. As new data arrives, we can update our embeddings with the latest commit by diffing the previous commit and the new one, resulting in a minimal set of embeddings to add, remove or update: | |
This allows you to perform differential processing of new data: If we have an `images` table in LanceDB we can keep track of the latest commit represented in that table. As new data arrives, we can update our embeddings with the latest commit by diffing the previous commit and the new one, resulting in a minimal set of embeddings to add, remove or update: |
uri="s3://example-repo/example-branch/path/to/lancedb", | ||
storage_options={ | ||
"endpoint": "https://example.lakefs.io", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment these line explain about the s3 endpoint to lakeFS repository and the lakefs endpoint.
|
||
# Using lakeFS with LanceDB | ||
|
||
[LanceDB](https://lancedb.com/) is a vector database that allows you to store and query vector data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding another line about the use-case of using LanceDB.
Used for storing and querying embeddings often used in AI/ML pipelines... something like that.
Add a documentation page describing the use cases and how-to guide on using LanceDB on top of lakeFS