Skip to content

Adds support for configuring MIG #656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions ansible/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -90,3 +90,6 @@ roles/*
!roles/gateway/**
!roles/alertmanager/
!roles/alertmanager/**
!roles/slurm_recompile/**
!roles/slurm_recompile/**

14 changes: 14 additions & 0 deletions ansible/extras.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,20 @@
name: cuda
tasks_from: "{{ 'runtime.yml' if appliances_mode == 'configure' else 'install.yml' }}"

- name: Setup vGPU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the role docs, do we need idracadm7 changes to support SR-IOV and/or the iommu role?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So they are bios settings. I'm actually unsure if we need those when not using vGPU.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs before slurm when run from site.yml, is that OK?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, at this point we are just creating the mig devices

hosts: vgpu
become: yes
gather_facts: yes
tags: vgpu
tasks:
- include_role:
name: stackhpc.linux.vgpu
tasks_from: "{{ 'configure.yml' if appliances_mode == 'configure' else 'install.yml' }}"
handlers:
- name: reboot
fail:
msg: Reboot handler for stackhpc.linux.vgpu role fired unexpectedly. This was supposed to be unreachable.

- name: Persist hostkeys across rebuilds
# Must be after filesystems.yml (for storage)
# and before portal.yml (where OOD login node hostkeys are scanned)
Expand Down
10 changes: 10 additions & 0 deletions ansible/fatimage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,16 @@
name: cloudalchemy.grafana
tasks_from: install.yml

- name: Add support for NVIDIA GPU auto detection to Slurm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like having these tasks outside a role - we've always regretted that. It can't be run with cuda:install.yml from extras.yml b/c that's before slurm, but maybe we could add it as a mig.yml taskfile which is called from here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - we should be really clear about idempotency/when its safe to run this. If its in the cuda role its obvious where to state that!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds reasonable. I did wonder if we'd want to recompile slurm for other reasons so could live in a slurm-recompile role?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly - for this specifically either way there's a cuda/slurm dependency so I'd go with sticking it in cuda for the moment, probably.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stuck it in slurm_reccompile, but will move if you prefer

hosts: cuda
become: yes
tasks:
- name: Recompile slurm
import_role:
name: slurm_recompile
vars:
recompile_slurm_nvml: "{{ groups.cuda | length > 0 }}"

- name: Run post.yml hook
vars:
appliances_environment_root: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}"
Expand Down
1 change: 1 addition & 0 deletions ansible/roles/compute_init/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ it also requires an image build with the role name added to the
| extras.yml | basic_users | All functionality [6] | No |
| extras.yml | eessi | All functionality [7] | No |
| extras.yml | cuda | None required - use image build | Yes [8] |
| extras.yml | vgpu | All functionality | Yes |
| extras.yml | persist_hostkeys | Not relevant for compute nodes | n/a |
| extras.yml | compute_init (export) | Not relevant for compute nodes | n/a |
| extras.yml | k9s (install) | Not relevant during boot | n/a |
Expand Down
8 changes: 8 additions & 0 deletions ansible/roles/compute_init/files/compute-init.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
enable_basic_users: "{{ os_metadata.meta.basic_users | default(false) | bool }}"
enable_eessi: "{{ os_metadata.meta.eessi | default(false) | bool }}"
enable_chrony: "{{ os_metadata.meta.chrony | default(false) | bool }}"
enable_vgpu: "{{ os_metadata.meta.vpgu | default(false) | bool }}"


# TODO: "= role defaults" - could be moved to a vars_file: on play with similar precedence effects
resolv_conf_nameservers: []
Expand Down Expand Up @@ -295,6 +297,12 @@
cmd: "cvmfs_config setup"
when: enable_eessi

- name: Configure VGPUs
include_role:
name: stackhpc.linux.vgpu
tasks_from: 'configure.yml'
when: enable_vgpu

# NB: don't need conditional block on enable_compute as have already exited
# if not the case
- name: Write Munge key
Expand Down
5 changes: 5 additions & 0 deletions ansible/roles/cuda/tasks/facts.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---

- name: Set cuda_facts_version_short
set_fact:
cuda_facts_version_short: "{{ cuda_version_short }}"
3 changes: 3 additions & 0 deletions ansible/roles/slurm_recompile/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
---
slurm_recompile_nvml: false

41 changes: 41 additions & 0 deletions ansible/roles/slurm_recompile/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
- name: Get facts about CUDA installation
import_role:
name: cuda
tasks_from: facts.yml

- name: Gather the package facts
ansible.builtin.package_facts:
manager: auto

- name: Set fact containing slurm package facts
set_fact:
slurm_package: "{{ ansible_facts.packages['slurm-slurmd-ohpc'].0 }}"

- name: Recompile and install slurm packages
shell: |
#!/bin/bash
source /etc/profile
set -eux
dnf download -y --source slurm-slurmd-ohpc-{{ slurm_package.version }}-{{ slurm_package.release }}
rpm -i slurm-ohpc-*.src.rpm
cd /root/rpmbuild/SPECS
dnf builddep -y slurm.spec
rpmbuild -bb{% if slurm_recompile_nvml | bool %} -D "_with_nvml --with-nvml=/usr/local/cuda-{{ cuda_facts_version_short }}/targets/x86_64-linux/"{% endif %} slurm.spec
dnf reinstall -y /root/rpmbuild/RPMS/x86_64/*.rpm
become: true

- name: Workaround missing symlink
# Workaround path issue: https://groups.google.com/g/slurm-users/c/cvGb4JnK8BY
command: ln -s /lib64/libnvidia-ml.so.1 /lib64/libnvidia-ml.so
args:
creates: /lib64/libnvidia-ml.so
when: slurm_recompile_nvml | bool

- name: Cleanup Dependencies
shell: |
#!/bin/bash
set -eux
set -o pipefail
dnf history list | grep Install | grep 'builddep -y slurm.spec' | head -n 1 | awk '{print $1}' | xargs dnf history -y undo
become: true
10 changes: 10 additions & 0 deletions ansible/validate.yml
Original file line number Diff line number Diff line change
Expand Up @@ -83,3 +83,13 @@
- import_role:
name: lustre
tasks_from: validate.yml

- name: Validate vGPU configuration
hosts: vgpu
become: yes
gather_facts: yes
tags: vgpu
tasks:
- include_role:
name: stackhpc.linux.vgpu
tasks_from: validate.yml
209 changes: 209 additions & 0 deletions docs/mig.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# vGPU/MIG configuration

This page details how to configure Multi Instance GPU (MIG) in Slurm.

## Pre-requisites

- Image built with cuda support. This should automatically recompile slurm against NVML.

## Inventory

Add relevant hosts to the ``vgpu`` group, for example in ```environments/$ENV/inventory/groups``:

```
[vgpu:children]
cuda
```

## Configuration

Use variables from the [stackhpc.linux.vgpu](https://github.com/stackhpc/ansible-collection-linux/tree/main/roles/vgpu) role.

For example in: `environments/<environment>/inventory/group_vars/all/vgpu`:

```
---
vgpu_definitions:
- pci_address: "0000:17:00.0"
mig_devices:
"1g.10gb": 4
"4g.40gb": 1
- pci_address: "0000:81:00.0"
mig_devices:
"1g.10gb": 4
"4g.40gb": 1
```

The appliance will use the driver installed via the ``cuda`` role.

Use ``lspci`` to determine the PCI addresses e.g:

```
[root@io-io-gpu-02 ~]# lspci -nn | grep -i nvidia
06:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
0c:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
46:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
4c:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
```

The supported profiles can be discovered by consulting the [NVIDIA documentation](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-mig-profiles)
or interactively by running the following on one of the compute nodes with GPU resources:

```
[rocky@io-io-gpu-05 ~]$ sudo nvidia-smi -i 0 -mig 1
Enabled MIG Mode for GPU 00000000:06:00.0
All done.
[rocky@io-io-gpu-05 ~]$ sudo nvidia-smi mig -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
|=============================================================================|
| 0 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
| 1 1 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
| 1 1 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
| 2 2 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
| 3 3 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
| 4 4 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
| 8 7 1 |
+-----------------------------------------------------------------------------+
| 1 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
| 1 1 0 |
+-----------------------------------------------------------------------------+
| 1 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 1 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
| 1 1 0 |
+-----------------------------------------------------------------------------+
| 1 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
| 2 2 0 |
+-----------------------------------------------------------------------------+
| 1 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
| 3 3 0 |
+-----------------------------------------------------------------------------+
| 1 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
| 4 4 0 |
+-----------------------------------------------------------------------------+
| 1 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
| 8 7 1 |
+-----------------------------------------------------------------------------+
| 2 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
| 1 1 0 |
+-----------------------------------------------------------------------------+
| 2 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 2 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
| 1 1 0 |
+-----------------------------------------------------------------------------+
| 2 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
| 2 2 0 |
+-----------------------------------------------------------------------------+
| 2 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
| 3 3 0 |
+-----------------------------------------------------------------------------+
| 2 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
| 4 4 0 |
+-----------------------------------------------------------------------------+
| 2 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
| 8 7 1 |
+-----------------------------------------------------------------------------+
| 3 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
| 1 1 0 |
+-----------------------------------------------------------------------------+
| 3 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 3 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
| 1 1 0 |
+-----------------------------------------------------------------------------+
| 3 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
| 2 2 0 |
+-----------------------------------------------------------------------------+
| 3 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
| 3 3 0 |
+-----------------------------------------------------------------------------+
| 3 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
| 4 4 0 |
+-----------------------------------------------------------------------------+
| 3 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
| 8 7 1 |
+-----------------------------------------------------------------------------+
```

## compute_init

Use the ``vgpu`` metadata option to enable creation of mig devices on rebuild.

## GRES configuration

You should stop terraform templating out partitions.yml and specify `openhpc_nodegroups` manually. To do this
set the `autogenerated_partitions_enabled` terraform variable to `false`. For example (`environments/production/tofu/main.tf`):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requires: #665


```
module "cluster" {
source = "../../site/tofu/"
...
# We manually populate this to add GRES. See environments/site/inventory/group_vars/all/partitions-manual.yml.
autogenerated_partitions_enabled = false
}
```

GPU types can be determined by deploying slurm without any gres configuration and then running
`sudo slurmd -G` on a compute node where GPU resources exist. An example is shown below:

```
[rocky@io-io-gpu-02 ~]$ sudo slurmd -G
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI
,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI
,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb Count=1 Index=291 ID=7696487 File=/dev/nvidia-caps/nvidia-cap291 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb Count=1 Index=417 ID=7696487 File=/dev/nvidia-caps/nvidia-cap417 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=336 ID=7696487 File=/dev/nvidia-caps/nvidia-cap336 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=345 ID=7696487 File=/dev/nvidia-caps/nvidia-cap345 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=354 ID=7696487 File=/dev/nvidia-caps/nvidia-cap354 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=507 ID=7696487 File=/dev/nvidia-caps/nvidia-cap507 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=516 ID=7696487 File=/dev/nvidia-caps/nvidia-cap516 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=525 ID=7696487 File=/dev/nvidia-caps/nvidia-cap525 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
```

GRES resources can then be configured manually. An example is shown below
(`environments/<environment>/inventory/group_vars/all/partitions-manual.yml`):

```
openhpc_partitions:
- name: cpu
- name: gpu

openhpc_nodegroups:
- name: cpu
- name: gpu
gres_autodetect: nvml
gres:
- conf: "gpu:nvidia_h100_80gb_hbm3:2"
- conf: "gpu:nvidia_h100_80gb_hbm3_4g.40gb:2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add hint on how to work out what the autodetection-created gres name is?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe slurmd -C or slurmd -G ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried:

[rocky@io-io-gpu-02 ~]$ slurmd -C
NodeName=io-io-gpu-02 CPUs=96 Boards=1 SocketsPerBoard=1 CoresPerSocket=96 ThreadsPerCore=1 RealMemory=772878
UpTime=14-20:45:08
[rocky@io-io-gpu-02 ~]$ slurmd -G
[rocky@io-io-gpu-02 ~]$

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

journalctl -u slurmd does print this information:

May 08 14:59:26 io-io-gpu-02.io.internal slurmd[174622]: slurmd-io-io-gpu-02: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
May 08 14:59:26 io-io-gpu-02.io.internal slurmd[174622]: slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-
95 CoreCnt=96 Links=6,6,-1,6 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
May 08 14:59:26 io-io-gpu-02.io.internal slurmd[174622]: slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-
95 CoreCnt=96 Links=6,6,6,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
May 08 14:59:26 io-io-gpu-02.io.internal slurmd[174622]: slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb Count=1 Index=292 ID=7696487 File=/dev/nvidia
2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292 Cores=0-95 CoreCnt=96 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML
May 08 14:59:26 io-io-gpu-02.io.internal slurmd[174622]: slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=337 ID=7696487 File=/dev/nvidia
2,/dev/nvidia-caps/nvidia-cap336,/dev/nvidia-caps/nvidia-cap337 Cores=0-95 CoreCnt=96 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML
May 08 14:59:26 io-io-gpu-02.io.internal slurmd-io-io-gpu-02[174622]: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
May 08 14:59:26 io-io-gpu-02.io.internal slurmd[174622]: slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=346 ID=7696487 File=/dev/nvidia
2,/dev/nvidia-caps/nvidia-cap345,/dev/nvidia-caps/nvidia-cap346 Cores=0-95 CoreCnt=96 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML
May 08 14:59:26 io-io-gpu-02.io.internal slurmd[174622]: slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=355 ID=7696487 File=/dev/nvidia
2,/dev/nvidia-caps/nvidia-cap354,/dev/nvidia-caps/nvidia-cap355 Cores=0-95 CoreCnt=96 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML
May 08 14:59:26 io-io-gpu-02.io.internal slurmd[174622]: slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb Count=1 Index=418 ID=7696487 File=/dev/nvidia
3,/dev/nvidia-caps/nvidia-cap417,/dev/nvidia-caps/nvidia-cap418 Cores=0-95 CoreCnt=96 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML
May 08 14:59:26 io-io-gpu-02.io.internal slurmd[174622]: slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=508 ID=7696487 File=/dev/nvidia
3,/dev/nvidia-caps/nvidia-cap507,/dev/nvidia-caps/nvidia-cap508 Cores=0-95 CoreCnt=96 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML
May 08 14:59:26 io-io-gpu-02.io.internal slurmd[174622]: slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=517 ID=7696487 File=/dev/nvidia
3,/dev/nvidia-caps/nvidia-cap516,/dev/nvidia-caps/nvidia-cap517 Cores=0-95 CoreCnt=96 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML
May 08 14:59:26 io-io-gpu-02.io.internal slurmd[174622]: slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=526 ID=7696487 File=/dev/nvidia
3,/dev/nvidia-caps/nvidia-cap525,/dev/nvidia-caps/nvidia-cap526 Cores=0-95 CoreCnt=96 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML
May 08 14:59:26 io-io-gpu-02.io.internal slurmd-io-io-gpu-02[174622]: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-95 CoreC
nt=96 Links=6,6,-1,6 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
May 08 14:59:26 io-io-gpu-02.io.internal slurmd-io-io-gpu-02[174622]: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-95 CoreC
nt=96 Links=6,6,6,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I needed sudo when doing slurmd -G:

[rocky@io-io-gpu-02 ~]$ sudo slurmd -G
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI
,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI
,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb Count=1 Index=291 ID=7696487 File=/dev/nvidia-caps/nvidia-cap291 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb Count=1 Index=417 ID=7696487 File=/dev/nvidia-caps/nvidia-cap417 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=336 ID=7696487 File=/dev/nvidia-caps/nvidia-cap336 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=345 ID=7696487 File=/dev/nvidia-caps/nvidia-cap345 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=354 ID=7696487 File=/dev/nvidia-caps/nvidia-cap354 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=507 ID=7696487 File=/dev/nvidia-caps/nvidia-cap507 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=516 ID=7696487 File=/dev/nvidia-caps/nvidia-cap516 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=525 ID=7696487 File=/dev/nvidia-caps/nvidia-cap525 Links=(null) Flags=HAS_FILE,HAS_TYPE,
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to the docs

- conf: "gpu:nvidia_h100_80gb_hbm3_1g.10gb:6"
```
4 changes: 4 additions & 0 deletions environments/common/inventory/group_vars/all/vgpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---

# Nvidia driver is provided by cuda role.
vgpu_nvidia_driver_install_enabled: false
4 changes: 4 additions & 0 deletions environments/common/inventory/groups
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,10 @@ freeipa_client
[cuda]
# Hosts to install NVIDIA CUDA on - see ansible/roles/cuda/README.md

[vgpu]
# FIXME: Update once PR merged
# Hosts where vGPU/MIG should be configured - see https://github.com/stackhpc/ansible-collection-linux/pull/43/files#diff-74e43d9a34244aa54721f4dbd12a029baa87957afd762b88c2677aa75414f514R75

[eessi]
# Hosts on which EESSI stack should be configured

Expand Down
5 changes: 4 additions & 1 deletion requirements.yml
Copy link
Collaborator

@sjpb sjpb May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few other things which need fixing given bumping stackhpc.openhpc:

  • common openhpc config
  • skeleton templating
  • caas templating
  • stackhpc environment openhpc overides
  • rebuild config
  • stackhpc.openhpc:validate.yml should get called from ansible/validate.yml

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jovial see #666 for an attempt to handle some/most of these.

Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ roles:
version: v25.3.2
name: stackhpc.nfs
- src: https://github.com/stackhpc/ansible-role-openhpc.git
version: v0.28.0
version: feature/gres-autodetect
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs bumping to a release

name: stackhpc.openhpc
- src: https://github.com/stackhpc/ansible-node-exporter.git
version: stackhpc
Expand Down Expand Up @@ -55,4 +55,7 @@ collections:
version: 0.0.15
- name: stackhpc.pulp
version: 0.5.5
- name: https://github.com/stackhpc/ansible-collection-linux
type: git
version: feature/mig-only
...
Loading