-
Notifications
You must be signed in to change notification settings - Fork 34
Adds support for configuring MIG #656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly comments re. where stuff is, and some minor typos etc
ansible/fatimage.yml
Outdated
tasks: | ||
- name: Get facts about CUDA installation | ||
import_role: cuda | ||
tasks_from: facts.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No such file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But also, see #652, cuda_version_short
is now just a rolevar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to push this file. Good spot; it just sets that variable as a fact so that I can access it outside the role :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that means it would actually cope with an override. Neat.
ansible/fatimage.yml
Outdated
- name: Recompile and install slurm packages | ||
shell: | | ||
#!/bin/bash | ||
dnf download --source slurm-slurmd-ohpc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondefing if really we need to get the installed slurm-slurmd-ohpc
version (would be in package facts) and use that to gurantee we are getting the same version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - I've explicitly pulled the same version
@@ -48,6 +48,20 @@ | |||
name: cuda | |||
tasks_from: "{{ 'runtime.yml' if appliances_mode == 'configure' else 'install.yml' }}" | |||
|
|||
- name: Setup vGPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the role docs, do we need idracadm7 changes to support SR-IOV and/or the iommu role?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So they are bios settings. I'm actually unsure if we need those when not using vGPU.
@@ -250,6 +250,27 @@ | |||
name: cloudalchemy.grafana | |||
tasks_from: install.yml | |||
|
|||
- name: Add support for NVIDIA GPU auto detection to Slurm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like having these tasks outside a role - we've always regretted that. It can't be run with cuda:install.yml
from extras.yml b/c that's before slurm, but maybe we could add it as a mig.yml
taskfile which is called from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - we should be really clear about idempotency/when its safe to run this. If its in the cuda role its obvious where to state that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds reasonable. I did wonder if we'd want to recompile slurm for other reasons so could live in a slurm-recompile role?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly - for this specifically either way there's a cuda/slurm dependency so I'd go with sticking it in cuda for the moment, probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stuck it in slurm_reccompile, but will move if you prefer
ansible/fatimage.yml
Outdated
dnf download --source slurm-slurmd-ohpc | ||
rpm -i slurm-ohpc-*.src.rpm | ||
dnf install -y @'Development Tools' | ||
cd /root/rpmbuild/SPECS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we cleanup afterwards (at least this directory, probably uninstalling devtools will remove some deps which actually we want) just to try avoid bloating the image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point I could possibly revert the transaction afterwards. The build dependencies do seem to pull in a whole load of packages.
a92cdab
to
994d8f6
Compare
No description provided.