[Workflow API]: Proposal on refactoring FLSpec and Runtime to enhance modularity #1317

ishant162 · 2025-01-28T08:32:04Z

ishant162
Jan 28, 2025
Collaborator

SUMMARY

One of the main goals of Workflow API is to clearly separate Workflow definition and Runtime infrastructure. While the current implementation is effective, there is some coupling between FLSpec and Runtime and scope for further improvement.

This proposal refines the interaction between FLSpec and Runtime classes, such as LocalRuntimeand FederatedRuntime to make the design cleaner and more modular.

MOTIVATION

(Current) High Level Design:

In the example shown above, both LocalRuntime and FederatedRuntime are directly assigned to FLSpec Instance illustrating how the runtime is integrated (and coupled) with the flow.

From the above illustration we can see that runtime is an attribute of FLSpec. When flflow.run() is called, the execution flows to the associated runtime, whether it is LocalRuntime or FederatedRuntime.

Potential issues with Current Approach:

Any future changes or enhancements to Runtime may need changes to FLSpec

Currently, flflow.run() is a blocking call, preventing users from querying the experiment status or performing other actions while the experiment is running. While it is possible to modify this behavior by passing an argument to indicate whether sync or async execution is desired, changes are required to FLSpec which should ideally not be the case.
Here are some possible Runtime related features where FLSpec might need to be modified:
- Async execution of flow shall require providing further APIs in FLSpec to user to query the status, retrieve the results and clean experiment data.
- For long running experiments ability to disconnect from Federation and reconnect maybe needed.
- API to enqueue multiple experiments on the Federated Infrastructure (dependency on Director support also).

PROPOSED APPROACH

(Proposed) High Level Design: To decouple FLSpec from Runtime, we propose:

Introducing a run() method within the runtime.
Instead of binding a runtime instance to the FLSpec, the runtime’s run method will accept the FLSpec instance as an argument.

This approach shall achieve a clean separation of Workflow and Runtimes, aligning with the principles of a well-defined Workflow API. With this approach it should be possible to enhance Runtime without impacting FLSpec.

TECHNICAL DETAILS

API Changes:
- The run() method will be added to LocalRuntime and FederatedRuntime to centralize execution logic within the runtime infrastructure.
- The run() method will be removed from FLSpec to streamline its responsibilities and improve the separation between Workflow definition and Workflow execution.
- Deprecation of run_local() and run_federated() in FLSpec: These methods will also be removed, with their functionality now managed by the respective runtime classes.
Backward Compatibility: This change will is NOT backward compatible. Existing workflows use flflow.run() will need to be updated to use the new run() method in the appropriate runtime classes.
Dependencies: None

KEY BENEFITS

Decouple FLSpec from Runtime for improved flexibility.
Promotes a modular and maintainable design.
Provides a clear boundary, allowing Runtime to evolve independently of FLSpec.

RISKS

Users familiar with the existing API may require some time and guidance to adapt to the new approach.
Existing tutorials would need to be modified and revalidated.
Potential introduction of regressions during the refactoring process.

MITIGATION

Documentation and Tutorials shall provide examples to help users become familiar with the new approach.
Focused reviews and existing test cases can be used to avoid regressions.

NEXT STEPS

Prepare detailed design for Runtime and message flow for further alignment.

teoparvanov · 2025-01-29T13:31:25Z

teoparvanov
Jan 29, 2025
Maintainer

This is great, I really like this approach very much, @ishant162! The proposed API is more intuitive than the current one IMO - similar to executing a program (the FLSpec) by a computer (the *Runtime).

I'm wondering though if there isn't a subtle aspect/intention of the original design that we could be missing? I remember @psfoley had some concerns with this approach, so hopefully he can provide additional context and insight.

0 replies

porteratzo · 2025-01-30T15:06:18Z

porteratzo
Jan 30, 2025
Maintainer

This is definitely a welcome change, I made a diagram in drawio of the current implementation of workflow API a while back, I'll post it here in case it can be of any help. https://drive.google.com/file/d/171yRMKWJceQIhHI4AOXdda5RmJ5n3igX/view?usp=sharing

0 replies

psfoley · 2025-02-05T05:43:25Z

psfoley
Feb 5, 2025
Maintainer

I don't disagree with the technical merits of the proposal - and I think this does lead to cleaner separate of runtime and workload (from a class hierarchy perspective). My original reservation was centered around making sure we can cleanly express the chaining of workflows together across different environments. There's all kinds of patterns emerging in federated learning that involve dependencies between distinct workflows run on different infrastructure:

Centralized model pretraining followed by federated finetuning
Using the results of one federation as the basis for a new federation
Federated training followed by federated evaluation
...

But after thinking through this further and creating the following example that compares the two implementations, I think there's good reason to move forward with this refactor:

model_pretraining_flow = ModelPretrainingFlow(...)
preprocessing_flow.runtime = LocalRuntime(...)

federated_finetuning = FederatedFinetuningFlow(...)
federated_finefuning.runtime = FederatedRuntime(...)

# Admitedly kind of ugly
federated_finetuning(model_pretraining.run()).run()

versus the proposal:

model_pretraining = ModelPretrainingFlow(...)
local_runtime = LocalRuntime(...)

federated_finetuning = FederatedFinetuningFlow(...)
federated_runtime = FederatedRuntime([small set of collaborators])

federated_evaluation = FederatedEvaluation(...)
secure_federated_runtime = SecureFederatedRuntime([collaborators across large hospital network])

local_runtime.run(model_pretraining)
federated_runtime.run(federated_finetuning(model_pretraining.model))
secure_federated_runtime.run(federated_evaluation(federated_finetuning.model))

0 replies

ishant162 · 2025-02-06T08:57:48Z

ishant162
Feb 6, 2025
Collaborator Author

In our review meeting, a detailed walkthrough of the proposal was provided, highlighting its key aspects, benefits, and shortcomings. The team extensively discussed its value moving forward, focusing on key improvements such as:

Establishing a clearer distinction between the FL spec and the runtime, allowing runtime enhancements without affecting the FL spec.
Laying the groundwork for future asynchronous workflow execution.
Introducing API changes early (preferably in the 1.8 timeframe) to minimize expensive refactorings, especially once the Workflow API moves beyond the experimental phase.

Everyone shared their feedback, leading to an in-depth discussion.

Final Decision:
The proposal was well-received and agreed to be taken up for further implementation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Workflow API]: Proposal on refactoring FLSpec and Runtime to enhance modularity #1317

{{title}}

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[Workflow API]: Proposal on refactoring FLSpec and Runtime to enhance modularity #1317

ishant162 Jan 28, 2025 Collaborator

Replies: 4 comments

teoparvanov Jan 29, 2025 Maintainer

porteratzo Jan 30, 2025 Maintainer

psfoley Feb 5, 2025 Maintainer

ishant162 Feb 6, 2025 Collaborator Author

ishant162
Jan 28, 2025
Collaborator

teoparvanov
Jan 29, 2025
Maintainer

porteratzo
Jan 30, 2025
Maintainer

psfoley
Feb 5, 2025
Maintainer

ishant162
Feb 6, 2025
Collaborator Author