Skip to content

Conversation

@andrewssobral
Copy link

Add NVIDIA GPU Support to Docker + Slurm Cluster

✅ Verified on Docker Engine 28.5.1 + Slurm 25.05 with NVIDIA RTX 2070 GPUs

Summary

This PR adds native GPU support to slurm-docker-cluster, enabling CUDA workloads within Slurm jobs via NVIDIA Container Toolkit integration.

It introduces a GPU-enabled worker (g1), GRES GPU configuration, and an example job to verify isolation and visibility of GPU resources through Slurm.

Motivation

Until now, slurm-docker-cluster supported only CPU-based nodes.
This update introduces GPU scheduling capabilities to enable testing and development of CUDA workloads within a fully containerized Slurm environment — useful for AI/ML pipelines, HPC prototyping, and CI validation of GPU-enabled jobs.


Key Features

🧩 Docker / Image Updates

  • Added Dockerfile.gpu — a CUDA 12.6–based image built with:
    • NVIDIA Container Toolkit
    • Slurm configuration copied dynamically based on Slurm version (e.g. 25.05)
    • Optional installation of cgroup.conf and gres.conf
  • Ensures environment variables (to restrict and identify visible GPUs):
    • bash NVIDIA_VISIBLE_DEVICES=0
    • bash CUDA_HOME=/usr/local/cuda

🐳 Docker Compose

  • Added new GPU node g1 service:
  • Uses the CUDA image (slurm-docker-cluster-gpu)
  • Declares NVIDIA_VISIBLE_DEVICES: 0 and device reservation for GPU 0
  • Mounts /sys/fs/cgroup for Slurm’s cgroup access
  • Joins the same slurm-network for full cluster interoperability

⚙️ Slurm Configuration

  • Updated slurm.conf:
GresTypes=gpu
NodeName=g1 CPUs=2 RealMemory=2000 Gres=gpu:1
PartitionName=cpu Nodes=c[1-2] Default=YES MaxTime=INFINITE State=UP
PartitionName=gpu Nodes=g1 Default=NO MaxTime=INFINITE State=UP
  • Added minimal config/common/gres.conf:
NodeName=g1 Name=gpu File=/dev/nvidia0
# To enable two (or more) GPUs:
# NodeName=g1 Name=gpu File=/dev/nvidia[0-1]
  • Optionalized cgroup.conf and gres.conf — gracefully skipped if not present.

🔁 Helper Script

  • update_slurmfiles.sh now syncs gres.conf into containers and restarts services cleanly.

Verification

Quick Start

# 1. Start the cluster
make up

# 2. Check node and partition info
docker exec slurmctld sinfo

# 3. Run a GPU job
docker exec -it slurmctld bash -lc 'srun -p gpu --gres=gpu:1 nvidia-smi -L'

Compatibility

  • Works with Slurm versions 24.11 and 25.05 (auto-detected at build time)
  • Compatible with both docker compose v2.40+ and Docker Engine 28+
  • Requires NVIDIA drivers and nvidia-container-toolkit installed on the host

Cluster Status

After make up:

docker exec slurmctld sinfo

Output:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu*         up   infinite      2   idle c[1-2]
gpu          up   infinite      1   idle g1

GPU Information Summary

Source GPUs Visible Notes
Host 2 GPUs (GPU 0, GPU 1) Both RTX 2070s detected
Container g1 1 GPU (GPU 0) Isolated via NVIDIA_VISIBLE_DEVICES=0
Slurm (srun) 1 GPU (GPU 0) Matches container visibility — verified via srun

GPU Test Command

Run a quick validation:

docker exec -it slurmctld bash -lc 'srun -p gpu --gres=gpu:1 nvidia-smi -L'

Expected output:

GPU 0: NVIDIA GeForce RTX 2070 (UUID: GPU-b7862af4-0f16-56bf-0d89-a36ead3f3f2f)

This confirms that Slurm correctly schedules on the GPU node and inherits container GPU visibility.


Example: GPU Isolation and Scalability

The host system has 2 physical GPUs, but only one (GPU 0) is mapped into the g1 container.
This demonstrates that:

  • GPU resources can be isolated per compute node (e.g., g1 → GPU 0, g2 → GPU 1)
  • Or multiple GPUs can be attached to a single node by adjusting:
  • NVIDIA_VISIBLE_DEVICES=all
  • gres.conf to File=/dev/nvidia[0-1]
  • device_ids: ['0'] to gpus: "all" on docker-compose.yml

This flexibility allows both multi-GPU single-node and multi-node GPU setups.


Runtime Evidence

===== GPU Device Info on host =====
GPU 0: NVIDIA GeForce RTX 2070 (UUID: GPU-b7862af4-0f16-56bf-0d89-a36ead3f3f2f)
GPU 1: NVIDIA GeForce RTX 2070 (UUID: GPU-22dfd02e-a668-a6a6-a90a-39d6efe475ee)

===== GPU Device Info inside g1 =====
NVDEV=0
/dev/nvidia0
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
GPU 0: NVIDIA GeForce RTX 2070 (UUID: GPU-b7862af4-0f16-56bf-0d89-a36ead3f3f2f)

===== GPU check via srun in Slurm =====
GPU 0: NVIDIA GeForce RTX 2070 (UUID: GPU-b7862af4-0f16-56bf-0d89-a36ead3f3f2f)

The output validates the intended GPU mapping and Slurm GRES behavior.


Why This Matters

  • Provides a reproducible GPU-enabled Slurm environment for local testing, CI, or hybrid setups.
  • Demonstrates resource isolation between host, container, and scheduler.
  • Serves as a base for future multi-GPU or heterogeneous cluster support.

Known Limitations

  • Currently supports a single GPU node (g1); multi-node GPU setups require adding additional g2, g3, etc.
  • GPU accounting is basic; Slurm GRES tracking is active but cgroup GPU enforcement is disabled for simplicity.
  • nvcc is available in the container, but CUDA samples are not included.

Implementation Summary

Checklist

  • Added CUDA image (Dockerfile.gpu)
  • Added GPU node (g1) to docker-compose
  • Updated Slurm configs for GRES and gpu partition
  • Optionalized config copy for cgroup.conf / gres.conf
  • Added live update support in update_slurmfiles.sh
  • Verified isolation via collect_docker_slurm_info.sh
  • Confirmed single-GPU scheduling via Slurm

Follow-ups

  • Update test scripts to detect partition dynamically (cpu/gpu)
  • Extend README with “Using GPUs” section
  • Add multi-GPU examples (g1+g2 or single node with 2 GPUs)
  • Automate nvidia-smi validation in test flow
  • Add multi-node GPU deployment support
    • Implement Docker Swarm mode to distribute GPU nodes (g1, g2, etc.) across multiple physical hosts:
      • Initialize Swarm (docker swarm init) on the manager
      • Join additional GPU machines using docker swarm join
      • Label each node with docker node update --label-add gpu=<id>
      • Use an overlay network for inter-host communication
      • Extend slurm.conf and gres.conf to include g2 and additional GPU nodes
    • Optionally, explore overlay networks without Swarm using tools like Weave Net or Flannel
    • For persistent storage, support NFS-mounted volumes or a distributed volume driver
    • Validate with:
     docker service ps slurm-cluster_g1
    docker service ps slurm-cluster_g2
    docker exec slurmctld sinfo
    • Goal: seamless Slurm cluster scaling across multiple GPU-equipped hosts

Copy link
Owner

@giovtorres giovtorres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this feature! I took a first pass and it looks good. I do have some questions and some requested changes.


PartitionName=normal Nodes=c1,c2 Default=YES MaxTime=INFINITE State=UP
PartitionName=cpu Nodes=c1,c2 Default=YES MaxTime=INFINITE State=UP
PartitionName=gpu Nodes=g1 Default=No MaxTime=INFINITE State=UP
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PartitionName=gpu Nodes=g1 Default=No MaxTime=INFINITE State=UP
PartitionName=gpu Nodes=g1 Default=NO MaxTime=INFINITE State=UP


PartitionName=normal Nodes=c1,c2 Default=YES MaxTime=INFINITE State=UP
PartitionName=cpu Nodes=c1,c2 Default=YES MaxTime=INFINITE State=UP
PartitionName=gpu Nodes=g1 Default=No MaxTime=INFINITE State=UP
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PartitionName=gpu Nodes=g1 Default=No MaxTime=INFINITE State=UP
PartitionName=gpu Nodes=g1 Default=NO MaxTime=INFINITE State=UP

# #SBATCH -w g1

# Make CUDA visible to Slurm-launched shells (they don't inherit container ENV)
export CUDA_HOME=/usr/local/cuda-12.6
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should allow flexibility here:

Suggested change
export CUDA_HOME=/usr/local/cuda-12.6
export CUDA_HOME="${CUDA_HOME:-/usr/local/cuda-12.6}"

Something like this. Thoughts?

dockerfile: Dockerfile.gpu
args:
SLURM_VERSION: ${SLURM_VERSION:-25.05.3}
# runtime: nvidia
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this comment be removed?

Comment on lines +163 to +166
# privileged: true
# gpus: "all"
# gpus:
# - device: "0" # Expose only GPU 0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these comments needed?

Comment on lines +1 to +123
# Multi-stage Dockerfile for Slurm runtime with NVIDIA GPU support
# Stage 1: Build RPMs using the builder image
# Stage 2: Install RPMs in a clean runtime image with NVIDIA support

ARG SLURM_VERSION

# ============================================================================
# Stage 1: Build RPMs
# ============================================================================
FROM rockylinux/rockylinux:9 AS builder

ARG SLURM_VERSION

# Enable CRB and EPEL repositories for development packages
RUN set -ex \
&& dnf makecache \
&& dnf -y update \
&& dnf -y install dnf-plugins-core epel-release \
&& dnf config-manager --set-enabled crb \
&& dnf makecache

# Install RPM build tools and dependencies
RUN set -ex \
&& dnf -y install \
autoconf \
automake \
bzip2 \
freeipmi-devel \
dbus-devel \
gcc \
gcc-c++ \
git \
gtk2-devel \
hdf5-devel \
http-parser-devel \
hwloc-devel \
json-c-devel \
libcurl-devel \
libyaml-devel \
lua-devel \
lz4-devel \
make \
man2html \
mariadb-devel \
munge \
munge-devel \
ncurses-devel \
numactl-devel \
openssl-devel \
pam-devel \
perl \
python3 \
python3-devel \
readline-devel \
rpm-build \
rpmdevtools \
rrdtool-devel \
wget \
&& dnf clean all \
&& rm -rf /var/cache/dnf

# Setup RPM build environment
RUN rpmdev-setuptree

# Copy RPM macros
COPY rpmbuild/slurm.rpmmacros /root/.rpmmacros

# Download official Slurm release tarball and build RPMs with slurmrestd enabled
RUN set -ex \
&& wget -O /root/rpmbuild/SOURCES/slurm-${SLURM_VERSION}.tar.bz2 \
https://download.schedmd.com/slurm/slurm-${SLURM_VERSION}.tar.bz2 \
&& cd /root/rpmbuild/SOURCES \
&& rpmbuild -ta slurm-${SLURM_VERSION}.tar.bz2 \
&& ls -lh /root/rpmbuild/RPMS/x86_64/

# ============================================================================
# Stage 2: Runtime image with NVIDIA GPU support
# ============================================================================
FROM rockylinux/rockylinux:9

LABEL org.opencontainers.image.source="https://github.com/giovtorres/slurm-docker-cluster" \
org.opencontainers.image.title="slurm-docker-cluster-gpu" \
org.opencontainers.image.description="Slurm Docker cluster on Rocky Linux 9 with NVIDIA GPU support" \
maintainer="Giovanni Torres"

ARG SLURM_VERSION

# Enable CRB and EPEL repositories for runtime dependencies
RUN set -ex \
&& dnf makecache \
&& dnf -y update \
&& dnf -y install dnf-plugins-core epel-release \
&& dnf config-manager --set-enabled crb \
&& dnf makecache

# Install runtime dependencies only
RUN set -ex \
&& dnf -y install \
bash-completion \
bzip2 \
gettext \
hdf5 \
http-parser \
hwloc \
json-c \
jq \
libaec \
libyaml \
lua \
lz4 \
mariadb \
munge \
numactl \
perl \
procps-ng \
psmisc \
python3 \
readline \
vim-enhanced \
wget \
&& dnf clean all \
&& rm -rf /var/cache/dnf

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid some duplication by extending the base image? For example,

FROM slurm-docker-cluster:${SLURM_VERSION}
RUN dnf config-manager --add-repo ...
[... rest of GPU only changes ...]

RUN set -ex \
&& dnf -y install \
nvidia-container-toolkit \
cuda-toolkit-12-6 \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a Docker ARG, e.g. ARG CUDA_VERSION=12.6 and then

RUN dnf install -y cuda-toolkit-${CUDA_VERSION//./-}
ENV CUDA_HOME=/usr/local/cuda-${VERSION}

&& dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
| tee /etc/yum.repos.d/nvidia-container-toolkit.repo \
&& sed -i -e 's/^gpgcheck=1/gpgcheck=0/' -e 's/^repo_gpgcheck=1/repo_gpgcheck=0/' /etc/yum.repos.d/nvidia-container-toolkit.repo \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why disable the gpgcheck? There should be an RPM GPG key that can be imported, no?

NodeName=g1 CPUs=4 RealMemory=1000 Gres=gpu:1 State=UNKNOWN

PartitionName=normal Nodes=c1,c2 Default=YES MaxTime=INFINITE State=UP
PartitionName=cpu Nodes=c1,c2 Default=YES MaxTime=INFINITE State=UP
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the partition name is a breaking change. Was this intentional?

@andrewssobral
Copy link
Author

Thank you @giovtorres for your suggestions. I will work on it and adapt this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants