Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch "crashed" tasks on task-level retry #7601

Closed
3 tasks done
trahloff opened this issue Nov 21, 2022 · 7 comments
Closed
3 tasks done

Catch "crashed" tasks on task-level retry #7601

trahloff opened this issue Nov 21, 2022 · 7 comments
Labels
enhancement An improvement of an existing feature

Comments

@trahloff
Copy link

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar request and didn't find it.
  • I searched the Prefect documentation for this feature.

Prefect Version

2.x

Describe the current behavior

A task-level retry will retry the task if it fails but not if it crashes.

Describe the proposed behavior

A task-level retry will retry the task if it fails and when it crashes.

Example Use

It can happen that some tasks crash because of exceptions within the Prefect client-side SDK that would be mitigated by a retry. This scenario fails the whole flow (interestingly in a failed state and not crashed) even if 1000 tasks succeed and 1 crashes.

For example, running many tasks in parallel can occasionally lead to the following exceptions:
httpcore.LocalProtocolError: Invalid input ConnectionInputs.SEND_HEADERS in state ConnectionState.CLOSED and httpcore.LocalProtocolError: Invalid input ConnectionInputs.RECV_PING in state ConnectionState.CLOSED.

It is often enough to wait a couple of minutes and run the same task again.

Additional context

No response

@trahloff trahloff added enhancement An improvement of an existing feature status:triage labels Nov 21, 2022
@zanieb
Copy link
Contributor

zanieb commented Nov 21, 2022

Hi @trahloff — I do not think we should retry on crashes. Crashes indicate infrastructure failure or failure of the Prefect engine itself. These errors are either BaseExceptions or occur outside of your code. We cannot retry in the same way, since the scope where the error is encountered is much wider than the try/except we use around execution of your code. Instead, we are intending to address all cases where the Prefect engine is failing (e.g. as you've referenced with the httpcore errors).

Thanks for the issue though, happy to chat further about it here.

@trahloff
Copy link
Author

Hi @madkinsz, thanks a lot for the context and lightning-fast response! Your perspective makes total sense.

Are you aware of any workaround to mitigate the Prefect engine errors or make the flow fail gracefully in the meantime?

@zanieb
Copy link
Contributor

zanieb commented Nov 21, 2022

We're adding client-level retries on those HTTP errors in #7593. We think those are upstream bugs, but we're going to deliver retries as a workaround until we can find the root cause and contribute a fix to httpcore. See #7442 for more details on that issue.

What would be the ideal way for the flow to fail gracefully?

@trahloff
Copy link
Author

Nice, great to see that the root cause is already identified and addressed!

What would be the ideal way for the flow to fail gracefully?

Now that I try to really pinpoint the requirement, I realize that "failing gracefully" might not even be the right term. Prefect gracefully fails the flow and isolates the crash right now.

From my perspective, it would be extremely helpful to fail under these conditions and include more hints for different crash types if this is realistically doable.
For example, this could be a log like It appears like the underlying infrastructure of the task run $TASK_RUN stopped unexpectedly or The Prefect engine ran into an unhandled error.

What do you think? Would that make sense within the current Prefect setup?

@zanieb
Copy link
Contributor

zanieb commented Nov 22, 2022

Yeah we could improve that. We are already roughly doing that by changing the messaging based on the exception type at https://github.com/PrefectHQ/prefect/blob/main/src/prefect/states.py#L112

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

@github-actions
Copy link
Contributor

This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants