-
Notifications
You must be signed in to change notification settings - Fork 11
Update slurm play for nodes #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughThis update revises Slurmd configuration files, updates host inventory and templates, introduces a new workstation config, and modifies Ansible tasks for role-based config deployment. The Slurm version is upgraded to 24.05.1, and the systemd service is tuned for process priority. Host groupings and attributes are also reorganized. Changes
Sequence Diagram(s)sequenceDiagram
participant Ansible
participant Host
participant Systemd
Ansible->>Host: Detect group (workstations/computenodes)
Ansible->>Host: Copy respective slurmd config file
Ansible->>Host: Install dependencies (incl. libncurses6)
Ansible->>Host: Update slurmd.service ExecStart (nice -n 19)
Ansible->>Systemd: Restart slurmd service
Host->>Systemd: slurmd starts with new config and priority
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
roles/common/tasks/main.yml (1)
18-18: Gate the Slurm role to the relevant host groupsImporting
slurm.ymlunconditionally will run Slurm setup on every host that receives thecommonrole—workstations, user-servers, but also nodes where Slurm is not expected (e.g. bastion, storage).
Add awhen:guard (orapplythe role only tocomputenodes,workstations,userservers) to avoid needless package installs and long playbook runs.- {import_tasks: slurm.yml, tags: slurm} + {import_tasks: slurm.yml, tags: slurm, when: "'slurm' in group_names"}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
files/default.slurmd.computenode(1 hunks)files/default.slurmd.workstation(1 hunks)inventory(2 hunks)roles/common/tasks/main.yml(1 hunks)roles/common/tasks/slurm.yml(3 hunks)roles/common/templates/hosts.j2(1 hunks)vars/slurm.yml(1 hunks)
🔇 Additional comments (13)
roles/common/templates/hosts.j2 (1)
30-36: Ensure all aliases & IPMI rows reflect the new hostnameYou replaced
cicus04withdnpus01on the main row, but the IPMI/-mgmt entry on line 36 still referencescicus04-mgmt. Verify that:
- The management address is correct for
dnpus01.- There are no stale
cicus04aliases elsewhere that could confuse name resolution.A mismatched
/etc/hostspair can break tools that rely on deterministic forward/reverse look-ups.vars/slurm.yml (1)
1-1: Confirm compatibility before jumping to Slurm 24.05.124.x introduces cgroup-v2 & burst-buffer changes. Make sure:
• Munge, plugins and accounting DB are upgraded in lock-step.
• Config syntax (e.g.SchedulerParameters) didn’t deprecate options you still set elsewhere.If you haven’t run a dry-run cluster upgrade, do so before merging.
files/default.slurmd.computenode (1)
11-11: Verify the--confstring syntaxSlurmd expects comma-separated key-value pairs (no spaces).
'Feature=computenode Weight=1'might be parsed as two arguments and fail to override node attributes.-SLURMD_OPTIONS="-Z --conf-server dnpus01.douglas.rtss.qc.ca --conf 'Feature=computenode Weight=1'" +SLURMD_OPTIONS="-Z --conf-server dnpus01.douglas.rtss.qc.ca --conf 'Feature=computenode,Weight=1'"Test with
slurmd -Con a node before rolling cluster-wide.files/default.slurmd.workstation (1)
11-11: Possible parsing issue in workstation overrideSame spacing problem as the compute config:
-SLURMD_OPTIONS="-Z --conf-server dnpus01.douglas.rtss.qc.ca --conf 'Gres=gpu:geforce Feature=workstation Weight=50'" +SLURMD_OPTIONS="-Z --conf-server dnpus01.douglas.rtss.qc.ca --conf 'Gres=gpu:geforce,Feature=workstation,Weight=50'"Without the commas, Slurmd may ignore everything after the first space. Confirm on a test workstation.
inventory (4)
22-22: New workstation added.Addition looks correct and follows naming convention.
69-70: Host additions and reorganization.The additions of
dnpws26and repositioning ofcichm01are consistent with the inventory structure.
73-77: New compute nodes added.The addition of
ciccs01throughciccs05follows the existing naming pattern and aligns with the Slurm configuration updates.
84-86: New userservers group with NIS roles.The userservers group with master/slave NIS roles is well-structured and aligns with the updated Slurm configuration that references these servers.
roles/common/tasks/slurm.yml (5)
23-23: Added libncurses6 dependency.Good addition to ensure all required dependencies are installed.
40-40: Added privilege escalation for file check.Correct addition since checking system files may require elevated privileges.
107-119: Conditional configuration deployment by node type.Excellent implementation of role-based configuration. The conditional copying based on group membership ensures appropriate slurmd configuration for workstations and compute nodes.
124-130: Added nice value tuning for slurmd process.The nice value of 19 sets lowest priority for slurmd, which may impact daemon responsiveness. Verify this is the intended behavior for your workload requirements.
134-135: Added execstart tag for selective execution.Good practice to enable targeted task execution during deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did the (encrypted) munge key get removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it wasn't removed, but it's different from the one that's in the playbook (I did a diff on the munge.keys = and saw that they were different when i was refactoring this PR since i had to remove a bunch of stuff).
| - name: enable slurmd | ||
| systemd: state=restarted enabled=true name=slurmd daemon_reload=yes | ||
| tags: | ||
| - execstart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better tag here with slurm?
|
Take a look at https://github.com/ottok/debcraft I was able to very easily build deb files which we can use to make our own repo, just download, unpack and it does the rest (uses docker) Just need to figure out which deb files we need for our setup since it builds lots of stuff. |
cool will do |
Summary by CodeRabbit
New Features
Bug Fixes
Chores