Skip to content

Conversation

dmartinol
Copy link
Contributor

What does this PR do?

The agent API allows to query multiple DBs using the vector_db_ids argument of the rag tool:

        toolgroups=[
            {
                "name": "builtin::rag",
                "args": {"vector_db_ids": [vector_db_id]},
            }
        ],

This means that multiple DBs can be used to compose an aggregated context by executing the query on each of them.

When documents are passed to the next agent turn, there is no explicit way to configure the vector DB where the embeddings will be ingested. In such cases, we can assume that:

  • if any vector_db_ids is given, we use the first one (it probably makes sense to assume that it's the only one in the list, otherwise we should loop on all the given DBs to have a consistent ingestion)
  • if no vector_db_ids is given, we can use the current logic to generate a default DB using the default provider. If multiple providers are defined, the API will fail as expected: the user has to provide details on where to ingest the documents.

(Closes #1270)

Test Plan

The issue description details how to replicate the problem.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 26, 2025
@dmartinol dmartinol marked this pull request as ready for review February 26, 2025 11:21
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems somewhat arbitrary; I think we should at least log that information. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with logging of course.
About the "arbitrary" part: what else could we do in this case? Some ideas that come to mind:

  • add an explicit config arg to identify the ingestion vector_db?
  • extend the concept of session vector_db to store a list of ids?
  • stop the execution in case more dbs are given? (when documents are also provided)
  • other?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions!

I don’t think the code should be making decisions on behalf of the user, so having a config arg to specify which DB to use in this scenario makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an important use case for multiple vector DBs is federating across those vector DBs. That's challenging to do well, and the approach to doing it is different when each vector DB has the same content and when they have different content. When they have the same content, you generally want to specify some sort of unique ID on each chunk and/or document that you can use to recognize when the same result came from two different sources so you can boost that result. All of that is out of scope for this PR of course, but it would be good to design the configuration for which DB to use in a way that reflects that in the future we might want the users to be able to select all and/or a subset and then provide additional configuration details about how to federate across them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwm4 👍 for that is out of scope for this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leseb I've added a new field insert_vector_db_id to configure the ingestion DB but I'm wondering how we can document this change. I see that the other argument vector_db_ids has not been well documented either, but just mentioned in code snippets.

@yanxi0830
Copy link
Contributor

I think we need to define the behaviour for when documents are provided. WDYT?

1/ If vector_db_id is not provided. We do not perform any indexing and sends the raw document content.
2/ If inserted_vector_db_id is provided for the document, we index the document into inserted_vector_db_id.
3/ When multiple vector_db_ids are provided in AgentConfig, but inserted_vector_db_id is not provided, we follow behaviour of (1)?

Reference discussion in #1118 (comment)

cc @hardikjshah

@dmartinol
Copy link
Contributor Author

1/ If vector_db_id is not provided. We do not perform any indexing and sends the raw document content.

If we want to provide an option to send the whole document content in the context, what about adding a new builtin vector-io provider that implements this logic? (in a separate issue/PR) This would also reuse the existing PDF parsing logic.

2/ If inserted_vector_db_id is provided for the document, we index the document into inserted_vector_db_id.

+1

3/ When multiple vector_db_ids are provided in AgentConfig, but inserted_vector_db_id is not provided, we follow behaviour of (1)?

I'd use and explicit provider for that, instead (see point 1).

=====
What about reviewing the args of the builtin::rag tool as follows:

  • documents_db_id: the DB to store the given documents (also used to retrieve context)
  • vector_db_ids: additional DBs used only to query the context

Sample specification:

  • empty documents:
    • empty or given documents_db_id:
      • context is retrieved from vector_db_ids
  • given documents:
    • empty documents_db_id: raise ValueError
    • given documents_db_id:
      • use is to ingest the document chunks
      • context is retrieved from documents_db_id + vector_db_ids
        This solution is similar to the latest commit, but removes some implicit behaviors.

Notes:

  • this option would allow to entirely remove the vector_db_id field from the session_info.
  • also, we can remove the logic to generate a default DB when none is given

The API would be more clear if we could change the definition of the documents field to hold both a vector_db_id and a list of Document. Unfortunately this option may generate some concerns because ATM the DBs are passed to the agent config but the documents are defined in the create turn API.

Reference discussion in #1118 (comment)
Thanks for linking to the other discussion!

@dmartinol
Copy link
Contributor Author

@leseb @yanxi0830 I prepared a different version with the changes described in the previous comment:

documents_db_id: the DB to store the given documents (also used to retrieve context)
vector_db_ids: additional DBs used only to query the context

- this option would allow to entirely remove the vector_db_id field from the session_info.
- also, we can remove the logic to generate a default DB when none is given

Let me know if you prefer me to submit it instead.
Note: as a side effect, both changes should also impact the example code (rag.md) in this repo and examples from the llama-stack-apps repo. I will take care of both

@ehhuang
Copy link
Contributor

ehhuang commented Feb 27, 2025

It feels odd to me to insert documents into the provided vector db from the RAG tool, since the it is an ad-hoc document only for the current thread, and unlikely something that the user expects to persist in the persistent vector db.

How about this:

  1. Documents are added to an ephemeral thread-level vector DB (with some TTL), which we create automatically when documents are present.
  2. If user wants to add documents to their persistent vector DB, they can also do so explicitly by calling the existing API.

@jwm4
Copy link
Contributor

jwm4 commented Feb 27, 2025

@ehhuang , I don't think I understand your comment. When you say this:

If user wants to add documents to their persistent vector DB, they can also do so explicitly by calling the existing API.

What do you mean by "the existing API"? Are you referring to the API for that vector DB, or some Llama Stack API and if the latter than which API is that?

FWIW, I do think the following snippet from the Quick Start guide is a little odd:

# Register a vector database
vector_db_id = f"test-vector-db-{uuid.uuid4().hex}"
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    provider_id=provider_id,
    embedding_model="all-MiniLM-L6-v2",
    embedding_dimension=384,
)

# Insert the documents into the vector database
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=512,
)

Specifically, I would have expected that instead of a client.tool_runtime.rag_tool.insert command, there would be something like a client.vector_dbs.insert command for inserting into the vector DB that I just registered and then a client.tool_runtime.rag_tool command of some sort for pointing the RAG tool at that vector DB. With that said, given this code, I do kind of expect that an index with ID vector_db_id is created and persisted (and the documents are inserted into it) in whatever vector DB I have configured as my vector DB provider when I run this code. That could be an ephemeral thread-level database if I specified an ephemeral in-line vector DB provider, but if I configured a remote provider then that's what I would expect to be used here.

@ehhuang
Copy link
Contributor

ehhuang commented Feb 27, 2025

Yea sorry I was referring to client.tool_runtime.rag_tool.insert to insert documents. So to rephrase,

  1. When documents is used with create_turn, an ephemeral db, associated with the session, is created and the documents are inserted there, independent of the vector_db_ids passed in with the rag_tool.
  2. If users want to insert documents into some persisted vector_db, they use client.tool_runtime.rag_tool.insert.

@dmartinol
Copy link
Contributor Author

  1. When documents is used with create_turn, an ephemeral db, associated with the session, is created and the documents are inserted there, independent of the vector_db_ids passed in with the rag_tool.

Hey, thanks for your input!
I’m new to the project but AFAIK the doc suggests to use the agent also to ingest a (persistent) vector db.
Also, I’m a bit concerned about the “ephemeral” db, since we expect this db to support document embeddings and similarity search query, through the regular sequence of insert and query functions (which includes PDF handling). What provider do you have in mind?

Note that the initial problem tracked by the associated issue is that the current implementation cannot create a default db for the session when multiple providers are configured: let’s find a solution that does not cause the same issue again while tempting to change the behavior 😉

@dmartinol
Copy link
Contributor Author

dmartinol commented Feb 28, 2025

Updated the PR to:

  • Manage two separate arguments in the RAG tool configuration:
        toolgroups=[
            {
                "name": "builtin::rag",
                "args": {
                  "vector_db_ids": [_DB_IDS_FOR_QUERY_PURPOSES_], # Optional
                  "documents_db_id": _DB_ID_FOR_INGESTION_PURPOSES_ # If provided, it's also used at query time
                },
            }
        ],
  • Remove vector database ID from the session info
  • Update sample code in RAG docs

@ehhuang
Copy link
Contributor

ehhuang commented Feb 28, 2025

Note that the initial problem tracked by the associated issue is that the current implementation cannot create a default db for the session when multiple providers are configured: let’s find a solution that does not cause the same issue again while tempting to change the behavior 😉

Sorry I read into this issue and code in more detail (am pretty new to this project too). I realized that what I suggested above was already the current behavior, except that it broke (i.e. need to specify the provider_id). Can we just choose a provider_id from available ones arbitrarily? Alternatively, we choose one provider_id and throw an error if it's not available and when documents are provided.

Re. documents_db_id, thanks for putting up the solution. My concern here is the added complexity. My understanding of the point of the documents feature is convenience: instead of having to set up a vector db for a session and ingesting documents manually, all you need to do is attach documents to a message and the work would be done for you.

With documents_db_id, users need to set up the vector db and manage it with the session. All documents does then is to save one line of inserting it to the documents_db_id, compared to not using documents in message. This no longer justifies the complexity of having to learn about this new concept IMO.

@dmartinol
Copy link
Contributor Author

Sorry I read into this issue and code in more detail (am pretty new to this project too). I realized that what I suggested above was already the current behavior, except that it broke (i.e. need to specify the provider_id). Can we just choose a provider_id from available ones arbitrarily?

Being "arbitrary" was the first comment I received, which then started the journey about trying to review the API and its behavior, my fault.
I will get back to the first option, I also think it's the best thing we can do w/o altering the original behavior. Also, it's a bit of a corner case as I'm not expecting many real setups with multiple providers for vector DBs.

If we want to change the API and behavior, we'll track and discuss it with a separate issue.

Re. documents_db_id, thanks for putting up the solution. My concern here is the added complexity. My understanding of the point of the documents feature is convenience: instead of having to set up a vector db for a session and ingesting documents manually, all you need to do is attach documents to a message and the work would be done for you.

Agree: Let's move on with the simpler solution then. It just have to be clear that this is a one-time consumption of these documents.

Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
…r an ephemeral vector db

Signed-off-by: Daniele Martinoli <[email protected]>
@dmartinol
Copy link
Contributor Author

dmartinol commented Feb 28, 2025

  • Reverted all changes to modify the API behavior and using the first cofigured provider when no provider_id is specified.
  • Updated test_chat_agent to align with latest changes
  • Note some changes are needed in llama-stack-apps examples, as the RAG memory tool only accepts a tool_prompt_format="python_list" instead of tool_prompt_format="json". Created another PR fix: Fixing tool prompt format llama-stack-apps#196 for that.

@ehhuang
Copy link
Contributor

ehhuang commented Feb 28, 2025

LG. Thanks! @dmartinol

@dmartinol
Copy link
Contributor Author

can we at least merge the changes to the broken UT?

@ehhuang ehhuang merged commit fb99868 into llamastack:main Mar 5, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent ingestion fails with multiple vector_io providers
7 participants