Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements/#427 #438

Merged
merged 3 commits into from
Sep 23, 2024
Merged

Improvements/#427 #438

merged 3 commits into from
Sep 23, 2024

Conversation

nghtm
Copy link
Collaborator

@nghtm nghtm commented Sep 20, 2024

Issue #, if available: #427

Description of changes: This PR has 3 commits, each addressing separate issues.

  • Commit 92d1b0c to fix incorrect config param in config.py, previously undocumented issue.

  • Commit abd677e to modify order which Docker / Enroot / Pyxis is called in lifecycle scripts, to mitigate chance of encountering race condition documented in issue 427

  • Commit 3c9a655 to further address 427 by adding a while loop that will poll (max 120s) dlami-nvme.service for active and execStart messages. This provides assurance that /opt/dlami/nvme is mounted to node prior to executing enroot configuration which will use /opt/dlami/nvme. This commit also updates the order of if/elif statement to first try /opt/dlami/nvme before /opt/sagemaker.

This PR has been tested successfully on a HyperPod cluster with 1 p5 and 4 c5.4xlarge to verify intended outcome and logs analyzed to confirm while loop functioning properly.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…) opertunisitically, allowing more time for nvme to mount on clusters which enable observability. note docker is installed independently if observability is enabled, and install_docker script is idempotent (line 6-9 install_docker.sh)
Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need changes.

…e and execStart messages. This provides assurance that /opt/dlami/nvme is mounted to node prior to executing enroot configuration which will use /opt/dlami/nvme. This commit also updates the order of if /elif statement to first try /opt/dlami/nvme before /opt/sagemaker. For more, see issue #427 #427

+correction from mhugueaws comment on original
Signed-off-by: nghtm <[email protected]>
@nghtm
Copy link
Collaborator Author

nghtm commented Sep 21, 2024

resolved comment from mhugueaws

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nghtm nghtm merged commit 7124b10 into main Sep 23, 2024
@nghtm nghtm deleted the improvements/#427 branch September 23, 2024 02:34
nghtm added a commit that referenced this pull request Sep 23, 2024
Correction to [PR 438](#438) to correctly close the if/elif statement, which was resulting in LCS errors
mhuguesaws pushed a commit that referenced this pull request Sep 23, 2024
Correction to [PR 438](#438) to correctly close the if/elif statement, which was resulting in LCS errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants