-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HyperPod Default Enroot Path is Root Volume #427
Comments
When run on a newly deployed HyperPod cluster (4x p5s), I see the following, which indicates that the enroot path is indeed being set to
|
As a temp workaround, considering 2 options: 1. Exporting env variables: According to enroot Docs, env variables will over-ride configuration. If we go this route, it is important to note that the paths must be created on the nodes seperately first:
Then export env vars:
Note the lifecycle script must be changed afterwards. 2. use script to modify inline the enroot.conf: What the script does is create new directories and set enroot paths at /opt/dlami/nvme/enroot, where we have 28TB of SSD compared to 500GB at /opt/sagemaker/. create file called update-enroot.sh
Apply the script via ansible or srun (requires sudo):
Run sanity check to check if enroot.conf is updated:
|
Another report of Enroot path being Above script was provided, but long term fix to |
The simplest fix would be to switch the order of this if statement, to use |
Will prioritize. |
…e and execStart messages. This provides assurance that /opt/dlami/nvme is mounted to node prior to executing enroot configuration which will use /opt/dlami/nvme. This commit also updates the order of if /elif statement to first try /opt/dlami/nvme before /opt/sagemaker. For more, see issue #427 #427 Signed-off-by: nghtm <[email protected]>
…e and execStart messages. This provides assurance that /opt/dlami/nvme is mounted to node prior to executing enroot configuration which will use /opt/dlami/nvme. This commit also updates the order of if /elif statement to first try /opt/dlami/nvme before /opt/sagemaker. For more, see issue #427 #427 +correction from mhugueaws comment on original Signed-off-by: nghtm <[email protected]>
* fix incorrect config param for update_neuron_sdk LCS * move Docker/Enroot/Pyxis installation after Observability (if enabled) opertunisitically, allowing more time for nvme to mount on clusters which enable observability. note docker is installed independently if observability is enabled, and install_docker script is idempotent (line 6-9 install_docker.sh) * add while loop that will poll (max 120s) dlami-nvme.service for active and execStart messages. This provides assurance that /opt/dlami/nvme is mounted to node prior to executing enroot configuration which will use /opt/dlami/nvme. This commit also updates the order of if /elif statement to first try /opt/dlami/nvme before /opt/sagemaker. For more, see issue #427 #427 +correction from mhugueaws comment on original Signed-off-by: nghtm <[email protected]>
The HyperPod default enroot path uses
/opt/sagemaker
due to the first if statement defined here. This is ussually approx 500 GB of root volume, depending on user configuration.For larger models, including Nemotron 340b, a larger volume is required, to avoid running out of enroot space as seen in error log below:
This can be fixed by changing the order of the if/elif statement to default to
/opt/dlami/nvme
(28TB on p5s) instead, which will make enroot use NVME instead of root volume space here.A PR is required to modify the order if/elif in the lifecycle scriptThe text was updated successfully, but these errors were encountered: