Skip to content

Adds support for configuring MIG #656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Adds support for configuring MIG #656

wants to merge 1 commit into from

Conversation

jovial
Copy link
Collaborator

@jovial jovial commented Apr 23, 2025

No description provided.

Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly comments re. where stuff is, and some minor typos etc

tasks:
- name: Get facts about CUDA installation
import_role: cuda
tasks_from: facts.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No such file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also, see #652, cuda_version_short is now just a rolevar

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to push this file. Good spot; it just sets that variable as a fact so that I can access it outside the role :D

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that means it would actually cope with an override. Neat.

- name: Recompile and install slurm packages
shell: |
#!/bin/bash
dnf download --source slurm-slurmd-ohpc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondefing if really we need to get the installed slurm-slurmd-ohpc version (would be in package facts) and use that to gurantee we are getting the same version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - I've explicitly pulled the same version

@@ -48,6 +48,20 @@
name: cuda
tasks_from: "{{ 'runtime.yml' if appliances_mode == 'configure' else 'install.yml' }}"

- name: Setup vGPU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the role docs, do we need idracadm7 changes to support SR-IOV and/or the iommu role?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So they are bios settings. I'm actually unsure if we need those when not using vGPU.

@@ -250,6 +250,27 @@
name: cloudalchemy.grafana
tasks_from: install.yml

- name: Add support for NVIDIA GPU auto detection to Slurm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like having these tasks outside a role - we've always regretted that. It can't be run with cuda:install.yml from extras.yml b/c that's before slurm, but maybe we could add it as a mig.yml taskfile which is called from here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - we should be really clear about idempotency/when its safe to run this. If its in the cuda role its obvious where to state that!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds reasonable. I did wonder if we'd want to recompile slurm for other reasons so could live in a slurm-recompile role?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly - for this specifically either way there's a cuda/slurm dependency so I'd go with sticking it in cuda for the moment, probably.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stuck it in slurm_reccompile, but will move if you prefer

dnf download --source slurm-slurmd-ohpc
rpm -i slurm-ohpc-*.src.rpm
dnf install -y @'Development Tools'
cd /root/rpmbuild/SPECS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cleanup afterwards (at least this directory, probably uninstalling devtools will remove some deps which actually we want) just to try avoid bloating the image.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point I could possibly revert the transaction afterwards. The build dependencies do seem to pull in a whole load of packages.

@jovial jovial force-pushed the feature/mig branch 2 times, most recently from a92cdab to 994d8f6 Compare April 25, 2025 20:47
@jovial jovial marked this pull request as ready for review April 28, 2025 08:34
@jovial jovial requested a review from a team as a code owner April 28, 2025 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants