Skip to content

HyperPod Default Enroot Path is Root Volume #427

Closed
@nghtm

Description

@nghtm

The HyperPod default enroot path uses /opt/sagemaker due to the first if statement defined here. This is ussually approx 500 GB of root volume, depending on user configuration.

For larger models, including Nemotron 340b, a larger volume is required, to avoid running out of enroot space as seen in error log below:

slurmstepd: error: pyxis: child 1528235 failed with error code: 1
slurmstepd: error: pyxis: failed to create container filesystem
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     [INFO] Extracting squashfs filesystem...
slurmstepd: error: pyxis:     Write on output file failed because No space left on device
slurmstepd: error: pyxis:     FATAL ERROR:writer: failed to write file /opt/sagemaker/tmp/enroot/data/user-1000/pyxis_167.0/usr/local/tensorrt/targets/x86_64-linux-gnu/lib/libnvinfer_static.a
slurmstepd: error: pyxis:     Parallel unsquashfs: Using 96 processors
slurmstepd: error: pyxis:     433207 inodes (820890 blocks) to write

This can be fixed by changing the order of the if/elif statement to default to/opt/dlami/nvme (28TB on p5s) instead, which will make enroot use NVME instead of root volume space here.A PR is required to modify the order if/elif in the lifecycle script

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions