-
Notifications
You must be signed in to change notification settings - Fork 236
Add NVIDIA GPU support across Docker/Slurm + example job #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
giovtorres
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this feature! I took a first pass and it looks good. I do have some questions and some requested changes.
|
|
||
| PartitionName=normal Nodes=c1,c2 Default=YES MaxTime=INFINITE State=UP | ||
| PartitionName=cpu Nodes=c1,c2 Default=YES MaxTime=INFINITE State=UP | ||
| PartitionName=gpu Nodes=g1 Default=No MaxTime=INFINITE State=UP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| PartitionName=gpu Nodes=g1 Default=No MaxTime=INFINITE State=UP | |
| PartitionName=gpu Nodes=g1 Default=NO MaxTime=INFINITE State=UP |
|
|
||
| PartitionName=normal Nodes=c1,c2 Default=YES MaxTime=INFINITE State=UP | ||
| PartitionName=cpu Nodes=c1,c2 Default=YES MaxTime=INFINITE State=UP | ||
| PartitionName=gpu Nodes=g1 Default=No MaxTime=INFINITE State=UP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| PartitionName=gpu Nodes=g1 Default=No MaxTime=INFINITE State=UP | |
| PartitionName=gpu Nodes=g1 Default=NO MaxTime=INFINITE State=UP |
| # #SBATCH -w g1 | ||
|
|
||
| # Make CUDA visible to Slurm-launched shells (they don't inherit container ENV) | ||
| export CUDA_HOME=/usr/local/cuda-12.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should allow flexibility here:
| export CUDA_HOME=/usr/local/cuda-12.6 | |
| export CUDA_HOME="${CUDA_HOME:-/usr/local/cuda-12.6}" |
Something like this. Thoughts?
| dockerfile: Dockerfile.gpu | ||
| args: | ||
| SLURM_VERSION: ${SLURM_VERSION:-25.05.3} | ||
| # runtime: nvidia |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this comment be removed?
| # privileged: true | ||
| # gpus: "all" | ||
| # gpus: | ||
| # - device: "0" # Expose only GPU 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these comments needed?
| # Multi-stage Dockerfile for Slurm runtime with NVIDIA GPU support | ||
| # Stage 1: Build RPMs using the builder image | ||
| # Stage 2: Install RPMs in a clean runtime image with NVIDIA support | ||
|
|
||
| ARG SLURM_VERSION | ||
|
|
||
| # ============================================================================ | ||
| # Stage 1: Build RPMs | ||
| # ============================================================================ | ||
| FROM rockylinux/rockylinux:9 AS builder | ||
|
|
||
| ARG SLURM_VERSION | ||
|
|
||
| # Enable CRB and EPEL repositories for development packages | ||
| RUN set -ex \ | ||
| && dnf makecache \ | ||
| && dnf -y update \ | ||
| && dnf -y install dnf-plugins-core epel-release \ | ||
| && dnf config-manager --set-enabled crb \ | ||
| && dnf makecache | ||
|
|
||
| # Install RPM build tools and dependencies | ||
| RUN set -ex \ | ||
| && dnf -y install \ | ||
| autoconf \ | ||
| automake \ | ||
| bzip2 \ | ||
| freeipmi-devel \ | ||
| dbus-devel \ | ||
| gcc \ | ||
| gcc-c++ \ | ||
| git \ | ||
| gtk2-devel \ | ||
| hdf5-devel \ | ||
| http-parser-devel \ | ||
| hwloc-devel \ | ||
| json-c-devel \ | ||
| libcurl-devel \ | ||
| libyaml-devel \ | ||
| lua-devel \ | ||
| lz4-devel \ | ||
| make \ | ||
| man2html \ | ||
| mariadb-devel \ | ||
| munge \ | ||
| munge-devel \ | ||
| ncurses-devel \ | ||
| numactl-devel \ | ||
| openssl-devel \ | ||
| pam-devel \ | ||
| perl \ | ||
| python3 \ | ||
| python3-devel \ | ||
| readline-devel \ | ||
| rpm-build \ | ||
| rpmdevtools \ | ||
| rrdtool-devel \ | ||
| wget \ | ||
| && dnf clean all \ | ||
| && rm -rf /var/cache/dnf | ||
|
|
||
| # Setup RPM build environment | ||
| RUN rpmdev-setuptree | ||
|
|
||
| # Copy RPM macros | ||
| COPY rpmbuild/slurm.rpmmacros /root/.rpmmacros | ||
|
|
||
| # Download official Slurm release tarball and build RPMs with slurmrestd enabled | ||
| RUN set -ex \ | ||
| && wget -O /root/rpmbuild/SOURCES/slurm-${SLURM_VERSION}.tar.bz2 \ | ||
| https://download.schedmd.com/slurm/slurm-${SLURM_VERSION}.tar.bz2 \ | ||
| && cd /root/rpmbuild/SOURCES \ | ||
| && rpmbuild -ta slurm-${SLURM_VERSION}.tar.bz2 \ | ||
| && ls -lh /root/rpmbuild/RPMS/x86_64/ | ||
|
|
||
| # ============================================================================ | ||
| # Stage 2: Runtime image with NVIDIA GPU support | ||
| # ============================================================================ | ||
| FROM rockylinux/rockylinux:9 | ||
|
|
||
| LABEL org.opencontainers.image.source="https://github.com/giovtorres/slurm-docker-cluster" \ | ||
| org.opencontainers.image.title="slurm-docker-cluster-gpu" \ | ||
| org.opencontainers.image.description="Slurm Docker cluster on Rocky Linux 9 with NVIDIA GPU support" \ | ||
| maintainer="Giovanni Torres" | ||
|
|
||
| ARG SLURM_VERSION | ||
|
|
||
| # Enable CRB and EPEL repositories for runtime dependencies | ||
| RUN set -ex \ | ||
| && dnf makecache \ | ||
| && dnf -y update \ | ||
| && dnf -y install dnf-plugins-core epel-release \ | ||
| && dnf config-manager --set-enabled crb \ | ||
| && dnf makecache | ||
|
|
||
| # Install runtime dependencies only | ||
| RUN set -ex \ | ||
| && dnf -y install \ | ||
| bash-completion \ | ||
| bzip2 \ | ||
| gettext \ | ||
| hdf5 \ | ||
| http-parser \ | ||
| hwloc \ | ||
| json-c \ | ||
| jq \ | ||
| libaec \ | ||
| libyaml \ | ||
| lua \ | ||
| lz4 \ | ||
| mariadb \ | ||
| munge \ | ||
| numactl \ | ||
| perl \ | ||
| procps-ng \ | ||
| psmisc \ | ||
| python3 \ | ||
| readline \ | ||
| vim-enhanced \ | ||
| wget \ | ||
| && dnf clean all \ | ||
| && rm -rf /var/cache/dnf | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid some duplication by extending the base image? For example,
FROM slurm-docker-cluster:${SLURM_VERSION}
RUN dnf config-manager --add-repo ...
[... rest of GPU only changes ...]| RUN set -ex \ | ||
| && dnf -y install \ | ||
| nvidia-container-toolkit \ | ||
| cuda-toolkit-12-6 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a Docker ARG, e.g. ARG CUDA_VERSION=12.6 and then
RUN dnf install -y cuda-toolkit-${CUDA_VERSION//./-}
ENV CUDA_HOME=/usr/local/cuda-${VERSION}| && dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo \ | ||
| && curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \ | ||
| | tee /etc/yum.repos.d/nvidia-container-toolkit.repo \ | ||
| && sed -i -e 's/^gpgcheck=1/gpgcheck=0/' -e 's/^repo_gpgcheck=1/repo_gpgcheck=0/' /etc/yum.repos.d/nvidia-container-toolkit.repo \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why disable the gpgcheck? There should be an RPM GPG key that can be imported, no?
| NodeName=g1 CPUs=4 RealMemory=1000 Gres=gpu:1 State=UNKNOWN | ||
|
|
||
| PartitionName=normal Nodes=c1,c2 Default=YES MaxTime=INFINITE State=UP | ||
| PartitionName=cpu Nodes=c1,c2 Default=YES MaxTime=INFINITE State=UP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the partition name is a breaking change. Was this intentional?
|
Thank you @giovtorres for your suggestions. I will work on it and adapt this PR. |
Add NVIDIA GPU Support to Docker + Slurm Cluster
Summary
This PR adds native GPU support to
slurm-docker-cluster, enabling CUDA workloads within Slurm jobs via NVIDIA Container Toolkit integration.It introduces a GPU-enabled worker (
g1), GRES GPU configuration, and an example job to verify isolation and visibility of GPU resources through Slurm.Motivation
Until now,
slurm-docker-clustersupported only CPU-based nodes.This update introduces GPU scheduling capabilities to enable testing and development of CUDA workloads within a fully containerized Slurm environment — useful for AI/ML pipelines, HPC prototyping, and CI validation of GPU-enabled jobs.
Key Features
🧩 Docker / Image Updates
Dockerfile.gpu— a CUDA 12.6–based image built with:25.05)cgroup.confandgres.confbash NVIDIA_VISIBLE_DEVICES=0bash CUDA_HOME=/usr/local/cuda🐳 Docker Compose
g1service:slurm-docker-cluster-gpu)NVIDIA_VISIBLE_DEVICES: 0and device reservation forGPU 0/sys/fs/cgroupfor Slurm’s cgroup access⚙️ Slurm Configuration
slurm.conf:config/common/gres.conf:cgroup.confandgres.conf— gracefully skipped if not present.🔁 Helper Script
update_slurmfiles.shnow syncsgres.confinto containers and restarts services cleanly.Verification
Quick Start
Compatibility
24.11and25.05(auto-detected at build time)docker compose v2.40+andDocker Engine 28+nvidia-container-toolkitinstalled on the hostCluster Status
After make up:
docker exec slurmctld sinfoOutput:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cpu* up infinite 2 idle c[1-2] gpu up infinite 1 idle g1GPU Information Summary
NVIDIA_VISIBLE_DEVICES=0srunGPU Test Command
Run a quick validation:
docker exec -it slurmctld bash -lc 'srun -p gpu --gres=gpu:1 nvidia-smi -L'Expected output:
GPU 0: NVIDIA GeForce RTX 2070 (UUID: GPU-b7862af4-0f16-56bf-0d89-a36ead3f3f2f)This confirms that Slurm correctly schedules on the GPU node and inherits container GPU visibility.
Example: GPU Isolation and Scalability
The host system has 2 physical GPUs, but only one (GPU 0) is mapped into the g1 container.
This demonstrates that:
NVIDIA_VISIBLE_DEVICES=allgres.conftoFile=/dev/nvidia[0-1]device_ids: ['0']togpus: "all"ondocker-compose.ymlThis flexibility allows both multi-GPU single-node and multi-node GPU setups.
Runtime Evidence
===== GPU Device Info on host ===== GPU 0: NVIDIA GeForce RTX 2070 (UUID: GPU-b7862af4-0f16-56bf-0d89-a36ead3f3f2f) GPU 1: NVIDIA GeForce RTX 2070 (UUID: GPU-22dfd02e-a668-a6a6-a90a-39d6efe475ee) ===== GPU Device Info inside g1 ===== NVDEV=0 /dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm /dev/nvidia-uvm-tools GPU 0: NVIDIA GeForce RTX 2070 (UUID: GPU-b7862af4-0f16-56bf-0d89-a36ead3f3f2f) ===== GPU check via srun in Slurm ===== GPU 0: NVIDIA GeForce RTX 2070 (UUID: GPU-b7862af4-0f16-56bf-0d89-a36ead3f3f2f)The output validates the intended GPU mapping and Slurm GRES behavior.
Why This Matters
Known Limitations
g1); multi-node GPU setups require adding additionalg2,g3, etc.nvccis available in the container, but CUDA samples are not included.Implementation Summary
Checklist
Follow-ups
multi-node GPUdeployment supportdocker swarm init) on the managerswarm joindocker node update --label-add gpu=<id>overlay networkfor inter-host communicationslurm.confandgres.confto includeg2and additional GPU nodesoverlay networks without Swarmusing tools likeWeave NetorFlannelNFS-mounted volumesor a distributed volume driverdocker service ps slurm-cluster_g1 docker service ps slurm-cluster_g2 docker exec slurmctld sinfo