-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YARN-11709. NodeManager should be shut down or blacklisted when it ca… #6960
Conversation
Thanks @ferdelyi for the change, seems good to me. |
💔 -1 overall
This message was automatically generated. |
LOG.error("Unrecoverable issue occurred. Marking the node as unhealthy to prevent " | ||
+ "further containers to get scheduled on the node and cause application failures. " + | ||
"Exit code from the container " + locId + "startLocalizer is : " + exitCode, e); | ||
nmContext.getNodeStatusUpdater().reportException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can execute container-executor, do we need to restart NodeManager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeekling thank you for looking into this change! Yes, when we hit an unrecoverable issue with the NM, the root cause needs to be fixed and the NM manually restarted. This way the RM will not schedule applications to the node while the issue is present. When we let the RM to place containers to the faulty NM, it can lead to application failures. E.g. by reaching maximum number of application attempts when the AM was scheduled to the same node twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if it could automatically recover when container-executor is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeekling thank you for your suggestion. To keep this change simple and easy to follow, I've filed your improvement idea under: YARN-11715
NodeManager should recover by itself once the container-executor can run program again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
…nnot run program /var/lib/yarn-ce/bin/container-executor
💔 -1 overall
This message was automatically generated. |
7031ee0
to
e6a1038
Compare
🎊 +1 overall
This message was automatically generated. |
…nnot run program /var/lib/yarn-ce/bin/container-executor
@zeekling and @K0K0V0K thank you for your feedback on this PR! May I request a review? I've added the unit test and opened the YARN-11715 Jira to improve the change further. |
🎊 +1 overall
This message was automatically generated. |
Thanks @ferdelyi LGTM |
…cannot run program /var/lib/yarn-ce/bin/container-executor (apache#6960)
…cannot run program /var/lib/yarn-ce/bin/container-executor (apache#6960)
…nnot run program /var/lib/yarn-ce/bin/container-executor
Description of PR
How was this patch tested?
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?