Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Simplify retry behaviour and unify it between plugins #3902

Merged

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Jul 27, 2023

RFC for RFC incubator discussion #3793

@fg91 fg91 added the rfc A label for RFC issues label Jul 27, 2023
@fg91 fg91 requested a review from hamersaw July 27, 2023 16:31
@fg91 fg91 force-pushed the fg91/rfc/simplify-retry-behaviour branch from f619e91 to 677c858 Compare July 27, 2023 16:32
rfc/system/0000-simplify-retry-behaviour.md Outdated Show resolved Hide resolved
rfc/system/0000-simplify-retry-behaviour.md Outdated Show resolved Hide resolved
rfc/system/0000-simplify-retry-behaviour.md Outdated Show resolved Hide resolved
rfc/system/0000-simplify-retry-behaviour.md Outdated Show resolved Hide resolved
rfc/system/0000-simplify-retry-behaviour.md Outdated Show resolved Hide resolved
rfc/system/0000-simplify-retry-behaviour.md Outdated Show resolved Hide resolved
rfc/system/0000-simplify-retry-behaviour.md Outdated Show resolved Hide resolved
rfc/system/0000-simplify-retry-behaviour.md Outdated Show resolved Hide resolved
rfc/system/0000-simplify-retry-behaviour.md Outdated Show resolved Hide resolved
rfc/system/0000-simplify-retry-behaviour.md Outdated Show resolved Hide resolved
@davidmirror-ops
Copy link
Contributor

2023-08-17: Contributor's meeting notes: more discussion is needed to decide if an update to this proposal is needed. There are open questions. Eduardo to ask Union contributors to comment here.

@fg91
Copy link
Member Author

fg91 commented Aug 23, 2023

@fellhorn and I tried out the WIP implementation of this RFC by @hamersaw on this branch:

https://github.com/flyteorg/flytepropeller/tree/feature/simplify-retries

We ran the following experiments with both the old and the new behaviour:

Test Old code New code Error type Comment
@task(retries=2)
def train():
raise Exception("No way back from this")
No retry No retry User Non-retriable exceptions stay non-retriable.
@task(retries=2)
def train():
raise FlyteRecoverableException("Please retry”)
2 retries 2 retries User The handling of recoverable user errors stays the same.
@torch.inference_mode()
@task( retries=2, )
def train():
2 retries 2 retries User One of our users recently wrapped the task decorator in another decorator which causes the entrypoint to fail. Against expectations this is handled as a user budget retry but the behaviour is not changed.
@torch.inference_mode()
@task(retries=2, task_config=Elastic(nnodes=2))
def train(): …
2 retries 2 retries User Same unchanged behaviour e.g. also for the kubeflow plugins
@task(retries=2, interruptible=True)
def train():
raise FlyteRecoverableException(”Please retry”)
2 retries, all interruptible 2 retries, all interruptible (last retry should be non-interruptible) User As explained in the RFC, we think it would be great if also user budget retries switched to non interruptible nodes for the last (n) retry. This is currently not the case but we implemented this functionality in a PR.
@task(retries=1)
# Delete the Pod
2 retries 1 retry System System retries are now counted against the user defined, unified retry budget as proposed by the RFC.
@task(retries=1,task_config=Elastic())
# Delete the pod
1 retry 1 retry User With the old behaviour, we see a difference compared to the line above. With the new behaviour, the UX is consistent between the plugin and normal python function task, even though internally in this case we count it as a user retry, not a system retry. It behaves as proposed in the RFC as the user doesn't see a difference.
@task(retries=4, interruptible=True)
def train():
raise FlyteRecoverableException(”Please retry”)
4 retries but all interruptible # But with our own interruptible logic 4 retries (3 of them interruptible) User  

TL;DR:
The WIP implementation removes the inconsistencies in retry behaviour between python function tasks and e.g. the kubeflow plugins and the user is able to control the number of preemptions (simulated by pod deletions) as outlined in the RFC. This makes everything easier to understand.

In @hamersaw's branch, the last retry is not yet switched to non-interruptible as this RFC proposes but we added this functionality this PR:
flyteorg/flytepropeller#610

Fabio Grätz and others added 6 commits September 10, 2023 11:18
Co-authored-by: Dennis Keck <[email protected]>
Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Co-authored-by: Dennis Keck <[email protected]>
Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
@fg91 fg91 force-pushed the fg91/rfc/simplify-retry-behaviour branch from 8a9409a to 0c818a6 Compare September 10, 2023 09:18
fg91 added a commit that referenced this pull request Sep 10, 2023
fg91 added a commit that referenced this pull request Sep 10, 2023
Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
@fg91
Copy link
Member Author

fg91 commented Sep 10, 2023

@davidmirror-ops Is the location proposed in this PR a good place to document that this RFC/behaviour exists?

@davidmirror-ops
Copy link
Contributor

@davidmirror-ops Is the location proposed in this PR a good place to document that this RFC/behaviour exists?

@fg91 Yes, I think it fits well. Thank you!

pingsutw
pingsutw previously approved these changes Sep 26, 2023
@fg91
Copy link
Member Author

fg91 commented Sep 27, 2023

@fellhorn and I opened a PR with some finishing touches into @hamersaw 's PR, see flyteorg/flytepropeller#623. As far as we can see, this completes the implementation of this RFC.

If the others in the steering committee also approve, from my side this is ready @davidmirror-ops .

Copy link
Member

@bstadlbauer bstadlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you!

@davidmirror-ops davidmirror-ops merged commit 0f2b71f into flyteorg:master Sep 28, 2023
10 checks passed
fg91 added a commit that referenced this pull request Oct 30, 2023
Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
fg91 added a commit that referenced this pull request Oct 30, 2023
* Document simplified retry behaviour introduced in #3902

Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>

* Update rsts/concepts/tasks.rst

Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>

---------

Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
@fg91
Copy link
Member Author

fg91 commented Dec 5, 2023

For documentation purposes:

In the RFC we added the following action item:

Add configuration for default number of retries in FlytePropeller. Currently the default maximum number of attempts is hardcoded. This should be exposed in the configuration.

In the propeller config, one can in fact set:

core:
  propeller:
      node-config:
        ignore-retry-cause: true
        default-max-attempts: 3

If the user now doesn't explicitly specify the number of retries in the task decorator, it defaults to (in this case) 3 attempts/2 retries.


This works regardless of whether one uses ignore-retry-cause: false or the new ignore-retry-cause: true behaviour which was proposed in this RFC. But for the default behaviour there is less need to configure default-max-attempts.


The compromise here is that if default-max-attempts > 1 in the platform config, it is not possible for users to turn off retries:

The retries kwarg in the @task decorator has a default value of 0, not None.
Flyteadmin/propeller, thus, have no way of knowing whether the user didn't specify the number of retries or explicitly set them to 0. So, in both cases, if the user doesn't specify the number of retries or sets them to 0, the default-max-attempts configured on the platform side will be used.

In theory, it would be nice if retries defaulted to None in the task decorator as then one could distinguish between "user didn't set retries at all, let's use the default value" vs "user explicitly set them to 0, let's not retry" in the backend.

In practice, this doesn't matter to us as we always want to retry and if something wasn't allowed to run more than once, we would instead use "some kind of external semaphore" anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc A label for RFC issues
Projects
Status: Implemented
Development

Successfully merging this pull request may close these issues.

6 participants