-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements/#427 #438
Merged
Merged
Improvements/#427 #438
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…) opertunisitically, allowing more time for nvme to mount on clusters which enable observability. note docker is installed independently if observability is enabled, and install_docker script is idempotent (line 6-9 install_docker.sh)
mhuguesaws
reviewed
Sep 20, 2024
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/install_enroot_pyxis.sh
Show resolved
Hide resolved
mhuguesaws
requested changes
Sep 20, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need changes.
…e and execStart messages. This provides assurance that /opt/dlami/nvme is mounted to node prior to executing enroot configuration which will use /opt/dlami/nvme. This commit also updates the order of if /elif statement to first try /opt/dlami/nvme before /opt/sagemaker. For more, see issue #427 #427 +correction from mhugueaws comment on original Signed-off-by: nghtm <[email protected]>
nghtm
force-pushed
the
improvements/#427
branch
from
September 21, 2024 00:02
3c9a655
to
cec6633
Compare
resolved comment from mhugueaws |
mhuguesaws
approved these changes
Sep 21, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
nghtm
added a commit
that referenced
this pull request
Sep 23, 2024
Correction to [PR 438](#438) to correctly close the if/elif statement, which was resulting in LCS errors
mhuguesaws
pushed a commit
that referenced
this pull request
Sep 23, 2024
Correction to [PR 438](#438) to correctly close the if/elif statement, which was resulting in LCS errors
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #, if available: #427
Description of changes: This PR has 3 commits, each addressing separate issues.
Commit 92d1b0c to fix incorrect config param in config.py, previously undocumented issue.
Commit abd677e to modify order which Docker / Enroot / Pyxis is called in lifecycle scripts, to mitigate chance of encountering race condition documented in issue 427
Commit 3c9a655 to further address 427 by adding a while loop that will poll (max 120s) dlami-nvme.service for active and execStart messages. This provides assurance that /opt/dlami/nvme is mounted to node prior to executing enroot configuration which will use /opt/dlami/nvme. This commit also updates the order of if/elif statement to first try /opt/dlami/nvme before /opt/sagemaker.
This PR has been tested successfully on a HyperPod cluster with 1 p5 and 4 c5.4xlarge to verify intended outcome and logs analyzed to confirm while loop functioning properly.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.