This document explains the generic steps required to upgrade a deployment of the Slurm Appliance with upstream changes from StackHPC. Generally, upstream releases will happen roughly monthly. Releases may contain new functionality and/or updated images.
Any site-specific instructions in docs/site/README.md should be reviewed in tandem with this.
This document assumes the deployment repository has:
- Remotes:
origin
referring to the site-specific remote repository.stackhpc
referring to the StackHPC repository at https://github.com/stackhpc/ansible-slurm-appliance.git.
- Branches:
main
- followingmain/origin
, the current site-specific code deployed to production.upstream
- followingmain/stackhpc
, i.e. the upstreammain
branch fromstackhpc
.
- The following environments:
$PRODUCTION
: a production environment, as defined by e.g.environments/production/
.$STAGING
: a production environment, as defined by e.g.environments/staging/
.$SITE_ENV
: a base site-specific environment, as defined by e.g.environments/mysite/
.
NB: Commands which should be run on the Slurm login node are shown below prefixed [LOGIN]$
.
All other commands should be run on the Ansible deploy host.
-
Update the
upstream
branch from thestackhpc
remote, including tags:git fetch stackhpc main --tags
-
Identify the latest release from the Slurm appliance release page. Below this release is shown as
vX.Y
. -
Ensure your local site branch is up to date and create a new branch from it for the site-specfic release code:
git checkout main git pull --prune git checkout -b update/vX.Y
-
Merge the upstream code into your release branch:
git merge vX.Y
It is possible this will introduce merge conflicts; fix these following the usual git prompts. Generally merge conflicts should only exist where functionality which was added for your site (not in a hook) has subsequently been merged upstream.
-
Push this branch and create a PR:
git push # follow instructions
-
Review the PR to see if any added/changed functionality requires alteration of site-specific configuration. In general changes to existing functionality will aim to be backward compatible. Alteration of site-specific configuration will usually only be necessary to use new functionality or where functionality has been upstreamed as above.
Make changes as necessary.
-
Identify image(s) from the relevant Slurm appliance release, and download using the link on the release plus the image name, e.g. for an image
openhpc-ofed-RL8-240906-1042-32568dbb
:wget https://object.arcus.openstack.hpc.cam.ac.uk/swift/v1/AUTH_3a06571936a0424bb40bc5c672c4ccb1/openhpc-images/openhpc-ofed-RL8-240906-1042-32568dbb
Note that some releases may not include new images. In this case use the image from the latest previous release with new images.
-
If required, build an "extra" image with local modifications, see docs/image-build.md.
-
Modify your site-specific environment to use this image, e.g. via
cluster_image_id
inenvironments/$SITE_ENV/tofu/variables.tf
. -
Test this in your staging cluster.
-
Commit changes and push to the PR created above.
-
Declare a future outage window to cluster users. A Slurm reservation can be used to prevent jobs running during that window, e.g.:
[LOGIN]$ sudo scontrol create reservation Flags=MAINT ReservationName="upgrade-vX.Y" StartTime=2024-10-16T08:00:00 EndTime=2024-10-16T10:00:00 Nodes=ALL Users=root
Note a reservation cannot be created if it may overlap with currently running jobs (defined by job or partition time limits).
-
At the outage window, check there are no jobs running:
[LOGIN]$ squeue
-
Deploy the branch created above to production, i.e. activate the production environment, run OpenTofu to reimage or delete/recreate instances with the new images (depending on how the root disk is defined), and run Ansible's
site.yml
playbook to reconfigure the cluster, e.g. as described in the main README.md. -
Check slurm is up:
[LOGIN]$ sinfo -R
The
-R
shows the reason for any nodes being down. -
If the above shows nodes done for having been "unexpectedly rebooted", set them up again:
[LOGIN]$ sudo scontrol update state=RESUME nodename=$HOSTLIST_EXPR
where the hostlist expression might look like e.g.
general-[0-1]
to reset state for nodes 0 and 1 of the general partition. -
Delete the reservation:
[LOGIN]$ sudo scontrol delete ReservationName="upgrade-slurm-v1.160"
-
Tell users the cluster is available again.