Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reusable pipelines #38

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hemajv
Copy link
Collaborator

@hemajv hemajv commented Jan 28, 2025

(Addresses #28)

This PR allows to build a reusable indexing and RAG pipeline.

Main Changes:

  • Updated rag.py to allow the run() function to execute the pipeline with required component arguments
  • Updated pipeline.py to allow the component arguments to be passed as a dictionary for the RAG pipeline
  • Updated api.py to introduce separate functions for building and executing the index/RAG pipelines
  • Updated test/sanity_test.py to demonstrate these changes

Testing the Code

  • You can run test/sanity_test.py to test all the new implemented changes

@hemajv hemajv added the enhancement New feature or request label Jan 28, 2025
Copy link
Collaborator

@ilya-kolchinsky ilya-kolchinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the remarks.

@@ -2,18 +2,23 @@
from pragmatic.pipelines.rag import RagPipelineWrapper
from pragmatic.pipelines.utils import produce_custom_settings


def index_path_for_rag(path, **kwargs):
def create_index_pipeline(path, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the API level, I believe we still want to have a unified "one-shot" method that handles both pipeline creation and execution. It will fit cases like RamaLama where simplicity is the top priority.
This is especially true for indexing which is typically done in a single stage as opposed to querying the model. But same applies for execute_rag_query as well.

1) Indexing mode (-i flag) - index a collection of documents from the given path.
2) RAG query mode (-r flag) - answer a given query with RAG using the previously indexed documents.
3) Evaluation mode (-e flag) - evaluate the RAG pipeline as specified in the settings - NOT YET OFFICIALLY SUPPORTED.
1) Index pipeline creation mode (-ip flag) - create an indexing pipeline from the given path.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not sure these new modes should be exposed via main.py. What would a user do with a pipeline that is immediately destroyed after the program exits? IMO, this new functionality should be reserved for API only. At some future point we might introduce a more advanced main.py that interacts with the user via command line similarly to ilab chat - then these new flags would make sense.
This is of course only my opinion let's discuss if you think otherwise :)

def run(self):
logger.debug(f"Executing the pipeline with the following arguments:\n{self._args}")
return self._pipeline.run(self._args)
def run(self, pipeline_args_dict=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So at pipeline creation we provide the default values for the run that are stored under self._args. Your new implementation keeps this self._args and either uses it or alternatively a dictionary provided as input. What if the caller only can/wants to provide a subset of the parameters? In this case, the two sources of parameters should be merged and the merge result should be used as input for the Haystack pipeline.
Another important point is whether we should keep the defaults provided at initialization. Wouldn't it be better to override the defaults (that will mostly be arbitrary anyway) with the new inputs? Perhaps add a Boolean parameter for run() to control that?

return result["answer_builder"]["answers"][0]

return result["llm"]["replies"][0]
else: #not in eval mode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this approach obviously works, it would be very challenging to maintain this code long-term going forward. It basically hard-codes the structure of the pipeline. Any attempt to introduce even a slightly different pipeline in the future will make this thing explode.
One way to resolve this would be to initialize the relevant parameters to None at pipeline creation (in all places where you removed them), then to rely on the code in the very beginning of the run() method to replace all the relevant fields with the up-to-date query. Please let me know if you'd like to handle it differently or if you have any concerns in this regard, and we will discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants