Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix retries if the failure is due to missed heartbeats. #451

Merged
merged 2 commits into from
Jun 26, 2020
Merged

Conversation

goyalankit
Copy link
Contributor

During a retry, missed heartbeat variable is not set. So if the previous retry failed due to a heartbeat failures, Tony will keep failing since the variable is not reset.

Here's an example log:
All the 10 retries failed quickly with the same error:

$ cat t | grep 'ERROR' | grep 'Application failed'
2020-06-26 08:44:27 ERROR ApplicationMaster:598 - Application failed due to missed heartbeats
2020-06-26 08:44:52 ERROR ApplicationMaster:598 - Application failed due to missed heartbeats
2020-06-26 08:44:53 ERROR ApplicationMaster:598 - Application failed due to missed heartbeats
2020-06-26 08:44:56 ERROR ApplicationMaster:598 - Application failed due to missed heartbeats
2020-06-26 08:44:57 ERROR ApplicationMaster:598 - Application failed due to missed heartbeats
2020-06-26 08:44:58 ERROR ApplicationMaster:598 - Application failed due to missed heartbeats
2020-06-26 08:45:00 ERROR ApplicationMaster:598 - Application failed due to missed heartbeats
2020-06-26 08:45:02 ERROR ApplicationMaster:598 - Application failed due to missed heartbeats
2020-06-26 08:45:03 ERROR ApplicationMaster:598 - Application failed due to missed heartbeats
2020-06-26 08:45:05 ERROR ApplicationMaster:598 - Application failed due to missed heartbeats
2020-06-26 08:45:07 ERROR ApplicationMaster:598 - Application failed due to missed heartbeats

Copy link
Member

@oliverhu oliverhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please create another issue to add tests

@goyalankit goyalankit changed the title Fix retries if the failures is due to missed heartbeats. Fix retries if the failure is due to missed heartbeats. Jun 26, 2020
@goyalankit
Copy link
Contributor Author

please create another issue to add tests

#452

@goyalankit goyalankit merged commit 4af32a1 into master Jun 26, 2020
@goyalankit goyalankit deleted the retry-fix branch June 26, 2020 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants