-
Notifications
You must be signed in to change notification settings - Fork 402
Docs: add lancedb integration page #9594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+92
−0
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| --- | ||
| title: LanceDB | ||
| description: This section explains how you can start using lakeFS with LanceDB. | ||
| status: new | ||
| --- | ||
|
|
||
| # Using lakeFS with LanceDB | ||
|
|
||
| [LanceDB](https://lancedb.com/) is a vector database that allows you to store and query vector data. | ||
|
|
||
| LanceDB works directly on an object store, so you can use it to store and query vector data in lakeFS. | ||
|
|
||
| ## Configuring LanceDB to work with lakeFS | ||
|
|
||
| To configure LanceDB to work with lakeFS, configure it to use the lakeFS [S3 Gateway](../understand/architecture.md#s3-gateway): | ||
|
|
||
|
|
||
| ```python | ||
| import lancedb # pip install lancedb | ||
|
|
||
| db = lancedb.connect( | ||
| # structure: s3://<repository ID>/<branch>/<path> | ||
| uri="s3://example-repo/example-branch/path/to/lancedb", | ||
| storage_options={ | ||
| # Your lakeFS S3 Gateway | ||
| "endpoint": "https://example.lakefs.io", | ||
| # Access key and secret of a lakeFS user with permissions | ||
| # to read and write data from that path | ||
| "access_key_id": "AKIAIOSFODNN7EXAMPLE", | ||
| "secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", | ||
| } | ||
| ) | ||
|
|
||
| # table "vectors" on the branch "example-branch" in the repository "example-repo" | ||
| table = db.open_table('vectors') | ||
|
|
||
| # update and query some data! | ||
| table.add([{'id': '1', 'embedding': generate_embedding('some data')}]) | ||
| df = db.open_table('vectors').search(generate_embedding('some other data')).limit(10).to_pandas() | ||
| ``` | ||
| !!! tip | ||
| For more information on configuring and using LanceDB, see the [LanceDB documentation](https://lancedb.com/docs/storage/integrations/). | ||
|
|
||
|
|
||
| ## Use Cases | ||
|
|
||
| Running LanceDB on top of lakeFS has a few major benefits: | ||
|
|
||
| ### Multimodal data storage | ||
|
|
||
| In many cases, LanceDB stores embeddings of data that exists elsewhere: documents, images, text files etc. These "raw" data files are processed to extract embeddings, which are then stored in LanceDB - but they are also stored in their raw form for retrieval, and in some cases metadata about them is stored in other formats for use in a data warehouse or data lake. | ||
|
|
||
| By co-locating these embeddings together with the other modalities, you can perform more complex queries and analysis without giving up on consistency: a commit will capture both the vector embeddings, raw data, and metadata as one atomic unit. | ||
|
|
||
|  | ||
|
|
||
|
|
||
| ### Differential processing of new data | ||
|
|
||
| lakeFS provides a highly performant and scalable way understand how data changes over time. For example, say we store raw image data in `images/` - we can update the raw data by adding, removing or updating images in the `images/` directory - and lakeFS will capture the changes as a commit. | ||
|
|
||
| This allows you to perform differential processing of new data: If we have an `images` table in LanceDB we can keep track of the latest commit represented in that table. As new data arrives, we can update our embeddings with the latest commit by diffing the previous commit and the new one, resulting in a minimal set of embeddings to add, remove or update: | ||
|
|
||
|
|
||
|  | ||
|
|
||
|
|
||
| ### Ensuring high quality data with Write, Audit, Publish hooks | ||
|
|
||
| Using [lakeFS hooks](../howto/hooks/index.md), you can ensure the vector embeddings meet a certain threshold for quality before they are made available for inference. These quality checks can be triggered automatically before new data is merged into a `main` or `production` branch. These tests could be: | ||
|
|
||
| * **Coverage:** How many of the images in the dataset have been processed and have embeddings? How many embeddings point to images no longer in the dataset? | ||
| * **Governance:** Are the embeddings consistent with the data policy? Do they contain any PII? | ||
| * **Accuracy:** Are the embeddings accurate? Do they match the expected output? | ||
| * **Drift:** Are metrics like centroid shift and norm drift within acceptable limits? | ||
|
|
||
| If any of these tests fail, the commit will be rejected and the data will not be made available for inference. This ensures that the data is of high quality and that it is consistent with the data policy. | ||
|
|
||
|  | ||
|
|
||
| ### Traceable & Reproducible inference | ||
|
|
||
| Once deployed on lakeFS, querying the vector embeddings has to be done by specifying the branch, tag or commit ID of the data you want to query. | ||
| Capturing the commit ID of the data you want to query allows you to reproduce the exact same results at any point in time. A common approach is to capture this commit ID in inference logs - allowing you to reproduce the exact same results as the user or agent that originally made the query. | ||
|
|
||
| While this sounds simple, vector databases often change quite frequently over time, making it hard to answer questions like "why did this customer complain about the chatbot being rude?" or "why did this product recommendation not work for this user?". | ||
|
|
||
| By tying that commit ID to the query, we can even go further and see the raw data as it existed at that point in time, complete with a commit log of who introduced that change, when and why. | ||
|
|
||
|  | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding another line about the use-case of using LanceDB.
Used for storing and querying embeddings often used in AI/ML pipelines... something like that.