-
Notifications
You must be signed in to change notification settings - Fork 4
Introduce get_has_h100() function. #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Information is cached in the RSNT_HAS_H100 environment variable
local f = io.open(modulercfile, "r") | ||
if f ~= nil then | ||
for line in f:lines() do | ||
if line == "module-version cuda/12.6 default" then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will break in future versions of cuda (when they become default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about something like cat /etc/slurm/slurm.conf | grep gpu:h100
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. If we bump the CUDA default on H100 systems we should adapt this line too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Killarney's slurm.conf
is in $SLURM_CONF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but why would we bump the default again before StdEnv/2026)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would very much like to avoid creating those modulerc_<cluster>
files and instead rely on system information. i.e., use the modulerc_<cluster>
only as a last resort.
Your point about the scheduler being down is fair though. I guess we can go back to my idea of reading the slurm.conf
(though more smartly than just grepping without consideration for comments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think parsing slurm.conf
can still be too error prone.
Let's go back to basics. What we have here is a bunch of clusters going online around 2025, which need a newer module default and some hiding, it's a bit broader than just the H100 thing. We've had the same thing happening with Narval in 2021.
So instead of focusing on H100, we could instead focus on "2025" and introduce an environment variable
RSNT_CLUSTER_YEAR
which we can set in CCconfig.lua
. Then have a configuration file with a simple table like this:
fir 2025
rorqual 2025
tamia 2025
...
narval 2021
beluga 2018
niagara 2018
graham 2016
cedar 2016
Then StdEnv/2023 can add modulerc_2025
, StdEnv/2020 can add modulerc_2021
, etc. The 2020/2021 stuff looks a bit superfluous but is here mostly to illustrate that this could be a little more future proof, if it's also past-proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like hard-coding names of clusters. I try to think more generally than just the Alliance. There are systems out there who are not managed by us, and it would be nice to support them out of the box. I would rather that we rely on features that we can detect if at all possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that detecting GPUs from a login node remains tricky, even more so for systems we don't even have access too, who may not even run slurm, then we should just tell them to set an appropriate environment variable in /etc/environment
or similar. In the end we map from cluster names ourselves as a convenience to save ourselves the trouble to ask sysadmins at every site to set those...
Within the list at https://docs.alliancecan.ca/wiki/Accessing_CVMFS#Environment_variables I agree it makes sense to complete it with a list of GPU types, but then RSNT_HAS_H100 is way too specific, I'd rather use RSNT_GPU_TYPES, which that could be e.g. "H100" or "A100" or "H100,MI300A". But unless we go the sysadmin route we will need a mapping from cluster name to RSNT_GPU_TYPES in cvmfs (e.g. CCconfig, which is already meant to be CC specific), otherwise we'd have way too many heuristics and overhead here for little benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSNT_GPU_TYPES
makes sense.
I still think we can at least try to detect through slurm.conf
. It won't be perfect, and it won't cover non-Slurm systems, but it would be better than nothing. Our interconnect detection and CPU vendor detection aren't perfect either (those do vary on different nodes), but we still do them. Defining RSNT_GPU_TYPES
can be an override to avoid the detection.
Information is cached in the RSNT_HAS_H100 environment variable