-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: Agent uses the first configured vector_db_id when documents are provided #1276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems somewhat arbitrary; I think we should at least log that information. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with logging of course.
About the "arbitrary" part: what else could we do in this case? Some ideas that come to mind:
- add an explicit config arg to identify the ingestion vector_db?
- extend the concept of session vector_db to store a list of ids?
- stop the execution in case more dbs are given? (when documents are also provided)
- other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions!
I don’t think the code should be making decisions on behalf of the user, so having a config arg to specify which DB to use in this scenario makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an important use case for multiple vector DBs is federating across those vector DBs. That's challenging to do well, and the approach to doing it is different when each vector DB has the same content and when they have different content. When they have the same content, you generally want to specify some sort of unique ID on each chunk and/or document that you can use to recognize when the same result came from two different sources so you can boost that result. All of that is out of scope for this PR of course, but it would be good to design the configuration for which DB to use in a way that reflects that in the future we might want the users to be able to select all and/or a subset and then provide additional configuration details about how to federate across them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwm4 👍 for that is out of scope for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leseb I've added a new field insert_vector_db_id
to configure the ingestion DB but I'm wondering how we can document this change. I see that the other argument vector_db_ids
has not been well documented either, but just mentioned in code snippets.
I think we need to define the behaviour for when documents are provided. WDYT? 1/ If vector_db_id is not provided. We do not perform any indexing and sends the raw document content. Reference discussion in #1118 (comment) cc @hardikjshah |
If we want to provide an option to send the whole document content in the context, what about adding a new builtin vector-io provider that implements this logic? (in a separate issue/PR) This would also reuse the existing PDF parsing logic.
+1
I'd use and explicit provider for that, instead (see point 1). =====
Sample specification:
Notes:
The API would be more clear if we could change the definition of the documents field to hold both a vector_db_id and a list of Document. Unfortunately this option may generate some concerns because ATM the DBs are passed to the agent config but the documents are defined in the create turn API.
|
@leseb @yanxi0830 I prepared a different version with the changes described in the previous comment:
Let me know if you prefer me to submit it instead. |
It feels odd to me to insert documents into the provided vector db from the RAG tool, since the it is an ad-hoc document only for the current thread, and unlikely something that the user expects to persist in the persistent vector db. How about this:
|
@ehhuang , I don't think I understand your comment. When you say this:
What do you mean by "the existing API"? Are you referring to the API for that vector DB, or some Llama Stack API and if the latter than which API is that? FWIW, I do think the following snippet from the Quick Start guide is a little odd: # Register a vector database
vector_db_id = f"test-vector-db-{uuid.uuid4().hex}"
client.vector_dbs.register(
vector_db_id=vector_db_id,
provider_id=provider_id,
embedding_model="all-MiniLM-L6-v2",
embedding_dimension=384,
)
# Insert the documents into the vector database
client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=512,
) Specifically, I would have expected that instead of a |
Yea sorry I was referring to
|
Hey, thanks for your input! Note that the initial problem tracked by the associated issue is that the current implementation cannot create a default db for the session when multiple providers are configured: let’s find a solution that does not cause the same issue again while tempting to change the behavior 😉 |
Updated the PR to:
toolgroups=[
{
"name": "builtin::rag",
"args": {
"vector_db_ids": [_DB_IDS_FOR_QUERY_PURPOSES_], # Optional
"documents_db_id": _DB_ID_FOR_INGESTION_PURPOSES_ # If provided, it's also used at query time
},
}
],
|
Sorry I read into this issue and code in more detail (am pretty new to this project too). I realized that what I suggested above was already the current behavior, except that it broke (i.e. need to specify the provider_id). Can we just choose a provider_id from available ones arbitrarily? Alternatively, we choose one provider_id and throw an error if it's not available and when documents are provided. Re. documents_db_id, thanks for putting up the solution. My concern here is the added complexity. My understanding of the point of the With |
Being "arbitrary" was the first comment I received, which then started the journey about trying to review the API and its behavior, my fault. If we want to change the API and behavior, we'll track and discuss it with a separate issue.
Agree: Let's move on with the simpler solution then. It just have to be clear that this is a one-time consumption of these documents. |
Signed-off-by: Daniele Martinoli <[email protected]>
…from session info Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
3e7b351
to
1181754
Compare
Signed-off-by: Daniele Martinoli <[email protected]>
…r an ephemeral vector db Signed-off-by: Daniele Martinoli <[email protected]>
|
LG. Thanks! @dmartinol |
can we at least merge the changes to the broken UT? |
What does this PR do?
The agent API allows to query multiple DBs using the
vector_db_ids
argument of therag
tool:This means that multiple DBs can be used to compose an aggregated context by executing the query on each of them.
When documents are passed to the next agent turn, there is no explicit way to configure the vector DB where the embeddings will be ingested. In such cases, we can assume that:
vector_db_ids
is given, we use the first one (it probably makes sense to assume that it's the only one in the list, otherwise we should loop on all the given DBs to have a consistent ingestion)vector_db_ids
is given, we can use the current logic to generate a default DB using the default provider. If multiple providers are defined, the API will fail as expected: the user has to provide details on where to ingest the documents.(Closes #1270)
Test Plan
The issue description details how to replicate the problem.