Skip to content

Introduce get_has_h100() function. #107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions lmod/SitePackage.lua
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,27 @@ function _get_cpu_vendor_id()
return "unknown"
end
end
function get_has_h100()
local has_h100 = os.getenv("RSNT_HAS_H100") or ""
if has_h100 and has_h100 ~= "" then
return has_h100
end
has_h100 = "false"
local cluster = os.getenv("CC_CLUSTER") or ""
local modulercfile = "/cvmfs/soft.computecanada.ca/config/lmod/modulerc_" .. cluster
local f = io.open(modulercfile, "r")
if f ~= nil then
for line in f:lines() do
if line == "module-version cuda/12.6 default" then
Copy link
Member

@mboisson mboisson Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break in future versions of cuda (when they become default)

Copy link
Member

@mboisson mboisson Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about something like cat /etc/slurm/slurm.conf | grep gpu:h100 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. If we bump the CUDA default on H100 systems we should adapt this line too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Killarney's slurm.conf is in $SLURM_CONF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but why would we bump the default again before StdEnv/2026)

Copy link
Member

@mboisson mboisson Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would very much like to avoid creating those modulerc_<cluster> files and instead rely on system information. i.e., use the modulerc_<cluster> only as a last resort.

Your point about the scheduler being down is fair though. I guess we can go back to my idea of reading the slurm.conf (though more smartly than just grepping without consideration for comments)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parsing slurm.conf can still be too error prone.

Let's go back to basics. What we have here is a bunch of clusters going online around 2025, which need a newer module default and some hiding, it's a bit broader than just the H100 thing. We've had the same thing happening with Narval in 2021.
So instead of focusing on H100, we could instead focus on "2025" and introduce an environment variable
RSNT_CLUSTER_YEAR which we can set in CCconfig.lua. Then have a configuration file with a simple table like this:

fir 2025
rorqual 2025
tamia 2025
...
narval 2021
beluga 2018
niagara 2018
graham 2016
cedar 2016

Then StdEnv/2023 can add modulerc_2025, StdEnv/2020 can add modulerc_2021, etc. The 2020/2021 stuff looks a bit superfluous but is here mostly to illustrate that this could be a little more future proof, if it's also past-proof.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like hard-coding names of clusters. I try to think more generally than just the Alliance. There are systems out there who are not managed by us, and it would be nice to support them out of the box. I would rather that we rely on features that we can detect if at all possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that detecting GPUs from a login node remains tricky, even more so for systems we don't even have access too, who may not even run slurm, then we should just tell them to set an appropriate environment variable in /etc/environment or similar. In the end we map from cluster names ourselves as a convenience to save ourselves the trouble to ask sysadmins at every site to set those...

Within the list at https://docs.alliancecan.ca/wiki/Accessing_CVMFS#Environment_variables I agree it makes sense to complete it with a list of GPU types, but then RSNT_HAS_H100 is way too specific, I'd rather use RSNT_GPU_TYPES, which that could be e.g. "H100" or "A100" or "H100,MI300A". But unless we go the sysadmin route we will need a mapping from cluster name to RSNT_GPU_TYPES in cvmfs (e.g. CCconfig, which is already meant to be CC specific), otherwise we'd have way too many heuristics and overhead here for little benefit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RSNT_GPU_TYPES makes sense.

I still think we can at least try to detect through slurm.conf. It won't be perfect, and it won't cover non-Slurm systems, but it would be better than nothing. Our interconnect detection and CPU vendor detection aren't perfect either (those do vary on different nodes), but we still do them. Defining RSNT_GPU_TYPES can be an override to avoid the detection.

has_h100 = "true"
end
end
f:close()
end
setenv("RSNT_HAS_H100", has_h100)
return has_h100
end
sandbox_registration{ get_has_h100 = get_has_h100 }
function get_interconnect()
local posix = require "posix"
if posix.stat("/sys/module/opa_vnic","type") == 'directory' then
Expand Down