Introduce get_has_h100() function. #107

bartoldeman · 2025-07-03T12:22:31Z

Information is cached in the RSNT_HAS_H100 environment variable

mboisson · 2025-07-03T12:25:17Z

lmod/SitePackage.lua

+	local f = io.open(modulercfile, "r")
+	if f ~= nil then
+		for line in f:lines() do
+			if line == "module-version cuda/12.6 default" then


this will break in future versions of cuda (when they become default)

how about something like cat /etc/slurm/slurm.conf | grep gpu:h100 ?

I'm not sure I understand. If we bump the CUDA default on H100 systems we should adapt this line too?

Killarney's slurm.conf is in $SLURM_CONF

(but why would we bump the default again before StdEnv/2026)

I would very much like to avoid creating those modulerc_<cluster> files and instead rely on system information. i.e., use the modulerc_<cluster> only as a last resort.

Your point about the scheduler being down is fair though. I guess we can go back to my idea of reading the slurm.conf (though more smartly than just grepping without consideration for comments)

I think parsing slurm.conf can still be too error prone.

Let's go back to basics. What we have here is a bunch of clusters going online around 2025, which need a newer module default and some hiding, it's a bit broader than just the H100 thing. We've had the same thing happening with Narval in 2021.
So instead of focusing on H100, we could instead focus on "2025" and introduce an environment variable
RSNT_CLUSTER_YEAR which we can set in CCconfig.lua. Then have a configuration file with a simple table like this:

fir 2025 rorqual 2025 tamia 2025 ... narval 2021 beluga 2018 niagara 2018 graham 2016 cedar 2016

Then StdEnv/2023 can add modulerc_2025, StdEnv/2020 can add modulerc_2021, etc. The 2020/2021 stuff looks a bit superfluous but is here mostly to illustrate that this could be a little more future proof, if it's also past-proof.

I don't like hard-coding names of clusters. I try to think more generally than just the Alliance. There are systems out there who are not managed by us, and it would be nice to support them out of the box. I would rather that we rely on features that we can detect if at all possible.

The problem is that detecting GPUs from a login node remains tricky, even more so for systems we don't even have access too, who may not even run slurm, then we should just tell them to set an appropriate environment variable in /etc/environment or similar. In the end we map from cluster names ourselves as a convenience to save ourselves the trouble to ask sysadmins at every site to set those...

Within the list at https://docs.alliancecan.ca/wiki/Accessing_CVMFS#Environment_variables I agree it makes sense to complete it with a list of GPU types, but then RSNT_HAS_H100 is way too specific, I'd rather use RSNT_GPU_TYPES, which that could be e.g. "H100" or "A100" or "H100,MI300A". But unless we go the sysadmin route we will need a mapping from cluster name to RSNT_GPU_TYPES in cvmfs (e.g. CCconfig, which is already meant to be CC specific), otherwise we'd have way too many heuristics and overhead here for little benefit.

RSNT_GPU_TYPES makes sense.

I still think we can at least try to detect through slurm.conf. It won't be perfect, and it won't cover non-Slurm systems, but it would be better than nothing. Our interconnect detection and CPU vendor detection aren't perfect either (those do vary on different nodes), but we still do them. Defining RSNT_GPU_TYPES can be an override to avoid the detection.

From ComputeCanada/software-stack-config#107

Introduce get_has_h100() function.

ebd23e0

Information is cached in the RSNT_HAS_H100 environment variable

mboisson reviewed Jul 3, 2025

View reviewed changes

bartoldeman added a commit to ComputeCanada/software-stack-custom that referenced this pull request Jul 3, 2025

Use get_has_h100() in StdEnv/2020.lua

f5e784c

From ComputeCanada/software-stack-config#107

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce get_has_h100() function. #107

Introduce get_has_h100() function. #107

Uh oh!

bartoldeman commented Jul 3, 2025

Uh oh!

mboisson Jul 3, 2025 •

edited

Loading

Uh oh!

mboisson Jul 3, 2025 •

edited

Loading

Uh oh!

bartoldeman Jul 3, 2025

Uh oh!

mboisson Jul 3, 2025

Uh oh!

bartoldeman Jul 3, 2025

Uh oh!

mboisson Jul 3, 2025 •

edited

Loading

Uh oh!

bartoldeman Jul 3, 2025

Uh oh!

mboisson Jul 3, 2025

Uh oh!

bartoldeman Jul 3, 2025

Uh oh!

mboisson Jul 4, 2025

Uh oh!

Uh oh!

Introduce get_has_h100() function. #107

Are you sure you want to change the base?

Introduce get_has_h100() function. #107

Uh oh!

Conversation

bartoldeman commented Jul 3, 2025

Uh oh!

mboisson Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mboisson Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bartoldeman Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

mboisson Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

bartoldeman Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

mboisson Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bartoldeman Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

mboisson Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

bartoldeman Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

mboisson Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mboisson Jul 3, 2025 •

edited

Loading

mboisson Jul 3, 2025 •

edited

Loading

mboisson Jul 3, 2025 •

edited

Loading