Closed
Description
The HyperPod default enroot path uses /opt/sagemaker
due to the first if statement defined here. This is ussually approx 500 GB of root volume, depending on user configuration.
For larger models, including Nemotron 340b, a larger volume is required, to avoid running out of enroot space as seen in error log below:
slurmstepd: error: pyxis: child 1528235 failed with error code: 1
slurmstepd: error: pyxis: failed to create container filesystem
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis: [INFO] Extracting squashfs filesystem...
slurmstepd: error: pyxis: Write on output file failed because No space left on device
slurmstepd: error: pyxis: FATAL ERROR:writer: failed to write file /opt/sagemaker/tmp/enroot/data/user-1000/pyxis_167.0/usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer_static.a
slurmstepd: error: pyxis: Parallel unsquashfs: Using 96 processors
slurmstepd: error: pyxis: 433207 inodes (820890 blocks) to write
This can be fixed by changing the order of the if/elif statement to default to/opt/dlami/nvme
(28TB on p5s) instead, which will make enroot use NVME instead of root volume space here.A PR is required to modify the order if/elif in the lifecycle script