Add reusable pipelines #38

hemajv · 2025-01-28T22:59:30Z

(Addresses #28)

This PR allows to build a reusable indexing and RAG pipeline.

Main Changes:

Updated rag.py to allow the run() function to execute the pipeline with required component arguments
Updated pipeline.py to allow the component arguments to be passed as a dictionary for the RAG pipeline
Updated api.py to introduce separate functions for building and executing the index/RAG pipelines
Updated test/sanity_test.py to demonstrate these changes

Testing the Code

You can run test/sanity_test.py to test all the new implemented changes

ilya-kolchinsky

Please fix the remarks.

ilya-kolchinsky · 2025-01-29T11:40:54Z

pragmatic/api.py

@@ -2,18 +2,23 @@
 from pragmatic.pipelines.rag import RagPipelineWrapper
 from pragmatic.pipelines.utils import produce_custom_settings

-
-def index_path_for_rag(path, **kwargs):
+def create_index_pipeline(path, **kwargs):


At the API level, I believe we still want to have a unified "one-shot" method that handles both pipeline creation and execution. It will fit cases like RamaLama where simplicity is the top priority.
This is especially true for indexing which is typically done in a single stage as opposed to querying the model. But same applies for execute_rag_query as well.

ilya-kolchinsky · 2025-01-29T11:46:55Z

pragmatic/main.py

-    1) Indexing mode (-i flag) - index a collection of documents from the given path.
-    2) RAG query mode (-r flag) - answer a given query with RAG using the previously indexed documents.
-    3) Evaluation mode (-e flag) - evaluate the RAG pipeline as specified in the settings - NOT YET OFFICIALLY SUPPORTED.
+    1) Index pipeline creation mode (-ip flag) - create an indexing pipeline from the given path.


I'm really not sure these new modes should be exposed via main.py. What would a user do with a pipeline that is immediately destroyed after the program exits? IMO, this new functionality should be reserved for API only. At some future point we might introduce a more advanced main.py that interacts with the user via command line similarly to ilab chat - then these new flags would make sense.
This is of course only my opinion let's discuss if you think otherwise :)

ilya-kolchinsky · 2025-01-29T11:53:27Z

pragmatic/pipelines/pipeline.py

-    def run(self):
-        logger.debug(f"Executing the pipeline with the following arguments:\n{self._args}")
-        return self._pipeline.run(self._args)
+    def run(self, pipeline_args_dict=None):


So at pipeline creation we provide the default values for the run that are stored under self._args. Your new implementation keeps this self._args and either uses it or alternatively a dictionary provided as input. What if the caller only can/wants to provide a subset of the parameters? In this case, the two sources of parameters should be merged and the merge result should be used as input for the Haystack pipeline.
Another important point is whether we should keep the defaults provided at initialization. Wouldn't it be better to override the defaults (that will mostly be arbitrary anyway) with the new inputs? Perhaps add a Boolean parameter for run() to control that?

ilya-kolchinsky · 2025-01-29T12:00:34Z

pragmatic/pipelines/rag.py

            return result["answer_builder"]["answers"][0]

-        return result["llm"]["replies"][0]
+        else: #not in eval mode


While this approach obviously works, it would be very challenging to maintain this code long-term going forward. It basically hard-codes the structure of the pipeline. Any attempt to introduce even a slightly different pipeline in the future will make this thing explode.
One way to resolve this would be to initialize the relevant parameters to None at pipeline creation (in all places where you removed them), then to rely on the code in the very beginning of the run() method to replace all the relevant fields with the up-to-date query. Please let me know if you'd like to handle it differently or if you have any concerns in this regard, and we will discuss.

Add reusable pipelines

df3da3a

hemajv requested a review from ilya-kolchinsky January 28, 2025 22:59

hemajv added the enhancement New feature or request label Jan 28, 2025

ilya-kolchinsky requested changes Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reusable pipelines #38

Add reusable pipelines #38

hemajv commented Jan 28, 2025

ilya-kolchinsky left a comment

ilya-kolchinsky Jan 29, 2025

ilya-kolchinsky Jan 29, 2025

ilya-kolchinsky Jan 29, 2025

ilya-kolchinsky Jan 29, 2025

Add reusable pipelines #38

Are you sure you want to change the base?

Add reusable pipelines #38

Conversation

hemajv commented Jan 28, 2025

ilya-kolchinsky left a comment

Choose a reason for hiding this comment

ilya-kolchinsky Jan 29, 2025

Choose a reason for hiding this comment

ilya-kolchinsky Jan 29, 2025

Choose a reason for hiding this comment

ilya-kolchinsky Jan 29, 2025

Choose a reason for hiding this comment

ilya-kolchinsky Jan 29, 2025

Choose a reason for hiding this comment