-
Notifications
You must be signed in to change notification settings - Fork 4
Introduce get_has_h100() function. #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bartoldeman
wants to merge
1
commit into
main
Choose a base branch
from
get-h100
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will break in future versions of cuda (when they become default)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about something like
cat /etc/slurm/slurm.conf | grep gpu:h100
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. If we bump the CUDA default on H100 systems we should adapt this line too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Killarney's
slurm.conf
is in$SLURM_CONF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but why would we bump the default again before StdEnv/2026)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would very much like to avoid creating those
modulerc_<cluster>
files and instead rely on system information. i.e., use themodulerc_<cluster>
only as a last resort.Your point about the scheduler being down is fair though. I guess we can go back to my idea of reading the
slurm.conf
(though more smartly than just grepping without consideration for comments)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think parsing
slurm.conf
can still be too error prone.Let's go back to basics. What we have here is a bunch of clusters going online around 2025, which need a newer module default and some hiding, it's a bit broader than just the H100 thing. We've had the same thing happening with Narval in 2021.
So instead of focusing on H100, we could instead focus on "2025" and introduce an environment variable
RSNT_CLUSTER_YEAR
which we can set inCCconfig.lua
. Then have a configuration file with a simple table like this:Then StdEnv/2023 can add
modulerc_2025
, StdEnv/2020 can addmodulerc_2021
, etc. The 2020/2021 stuff looks a bit superfluous but is here mostly to illustrate that this could be a little more future proof, if it's also past-proof.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like hard-coding names of clusters. I try to think more generally than just the Alliance. There are systems out there who are not managed by us, and it would be nice to support them out of the box. I would rather that we rely on features that we can detect if at all possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that detecting GPUs from a login node remains tricky, even more so for systems we don't even have access too, who may not even run slurm, then we should just tell them to set an appropriate environment variable in
/etc/environment
or similar. In the end we map from cluster names ourselves as a convenience to save ourselves the trouble to ask sysadmins at every site to set those...Within the list at https://docs.alliancecan.ca/wiki/Accessing_CVMFS#Environment_variables I agree it makes sense to complete it with a list of GPU types, but then RSNT_HAS_H100 is way too specific, I'd rather use RSNT_GPU_TYPES, which that could be e.g. "H100" or "A100" or "H100,MI300A". But unless we go the sysadmin route we will need a mapping from cluster name to RSNT_GPU_TYPES in cvmfs (e.g. CCconfig, which is already meant to be CC specific), otherwise we'd have way too many heuristics and overhead here for little benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSNT_GPU_TYPES
makes sense.I still think we can at least try to detect through
slurm.conf
. It won't be perfect, and it won't cover non-Slurm systems, but it would be better than nothing. Our interconnect detection and CPU vendor detection aren't perfect either (those do vary on different nodes), but we still do them. DefiningRSNT_GPU_TYPES
can be an override to avoid the detection.