Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically adding/removing workers #529

Open
conradoverta opened this issue Aug 17, 2023 · 2 comments
Open

Dynamically adding/removing workers #529

conradoverta opened this issue Aug 17, 2023 · 2 comments

Comments

@conradoverta
Copy link

Hi, folks!

I apologize if this has been asked before, but I searched through the issues and only found two tangential ones (#182 and #319).

I'm curious about the possibility of adding and removing workers from a data flow cluster. More specifically, thinking about leveraging cloud where we might lose workers because nodes are replaced or we can add more to scale up. I've checked the documentation and it seems like the list of workers is static.

Is there any documentation about this? How would we manage to modify the cluster configuration?

Thanks!

@frankmcsherry
Copy link
Member

Hello! It's a great question.

I think the short answer is "no". It is challenging to add and remove workers, as much of the shared state they use to coordinate relies on static assumptions about e.g. their number. There was some work at ETHZ (@antiguru or @utaal: do either of you recall details?) on dynamic workers, I think virtualizing the idea of a worker, and allowing with some interruption the reconfiguration. However, "losing" workers seems very problematic, as they are potentially indistinguishable from a very busy worker who needs to be awaited.

What I might suggest instead is to think of the workers and their operators as async tasks that can themselves farm out work, potentially to an elastic set of cloud workers, leaving the operators with the role of confirming and locking in completed work. This gives you access to timely dataflow's progress tracking, while allowing you to step away from its "exactly this many workers" threadpool model. The workers could shuttle resulting data around, or could shuttle references to cloud storage (e.g. S3 blob names) that the cloud workers could then pick up.

@conradoverta
Copy link
Author

Thanks for the reply, Frank!

That's a very interesting idea! So basically have timely handle the core consistency and work scheduling based on the flow specified, while the bigger state and worker is farmed outside. This could make the timely clusters pretty thin, which makes failure (e.g. network partition) much less likely.

I imagine, as part of the dataflow, we could save the smaller timely state (e.g. S3 locations) pretty frequently and reload the state at boot time to continue work in case of failure. This could allow pretty ephemeral clusters that can be restored after "every" step.

Is there some place (timely docs or external) that you'd recommend to help me think about state saving and resume? I have a rough idea what that would look like, but I'm sure there are a lot of nuance that I haven't thought about yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants