Skip to content

Commit 399bac8

Browse files
authored
Merge pull request #746 from tdthatcher/ijobs-nodenames
Slurm/interactive_jobs.md updates
2 parents 4f5cf31 + b5e3956 commit 399bac8

File tree

1 file changed

+51
-40
lines changed

1 file changed

+51
-40
lines changed

docs/Documentation/Slurm/interactive_jobs.md

+51-40
Original file line numberDiff line numberDiff line change
@@ -25,18 +25,18 @@ salloc: job 512998 queued and waiting for resources
2525
salloc: job 512998 has been allocated resources
2626
salloc: Granted job allocation 512998
2727
salloc: Waiting for resource configuration
28-
salloc: Nodes r2i2n5,r2i2n6 are ready for job
29-
[hpc_user@r2i2n5 ~]$
28+
salloc: Nodes x1008c7s6b1n0,x1008c7s6b1n1 are ready for job
29+
[hpc_user@x1008c7s6b1n0 ~]$
3030
```
3131

3232
You can view the nodes that are assigned to your interactive jobs using one of these methods:
3333

3434
```
3535
$ echo $SLURM_NODELIST
36-
r2i2n[5-6]
36+
x1008c7s6b1n[0-1]
3737
$ scontrol show hostname
38-
r2i2n5
39-
r2i2n6
38+
x1008c7s6b1n0
39+
x1008c7s6b1n1
4040
```
4141

4242
Once a job is allocated, you will automatically "ssh" to the first allocated node so you do not need to manually ssh to the node after it is assigned. If you requested more than one node, you may ssh to any of the additional nodes assigned to your job.
@@ -50,13 +50,17 @@ Type `exit` when finished using the node.
5050

5151
Interactive jobs are useful for many tasks. For example, to debug a job script, users may submit a request to get a set of nodes for interactive use. When the job starts, the user "lands" on a compute node, with a shell prompt. Users may then run the script to be debugged many times without having to wait in the queue multiple times.
5252

53-
A debug job allows up to two nodes to be available with shorter wait times when the system is heavily utilized. This is accomplished by limiting the number of nodes to 2 per job allocation and specifying `--partition=debug`. For example:
53+
A debug job allows up to two nodes to be available with shorter wait times when the system is heavily utilized. This is accomplished by specifying `--partition=debug`. For example:
5454

5555
```
56-
[hpc_user@el1 ~]$ salloc --time=60 --accounft=<handle> --nodes=2 --partition=debug
56+
[hpc_user@kl1 ~]$ salloc --time=60 --account=<handle> --partition=debug
5757
```
5858

59-
A debug node will only be available for a maximum wall time of 1 hour.
59+
Add `--nodes=2` to claim two nodes.
60+
61+
Add `--gpus=#` (substituting the number of GPUs you want to use) to claim a debug GPU node. Note that there are fewer GPU nodes in the debug queue, so there may be more of a wait time.
62+
63+
A debug job on any node type will only be available for jobs with a maximum walltime (--time) of 1 hour, and only one debug job at a time is permitted per person.
6064

6165
## Sample Interactive Job Commands
6266

@@ -75,8 +79,8 @@ $ salloc --time=20 --account=<handle> --nodes=2
7579
The above salloc command will log you into one of the two nodes automatically. You can then launch your software using an srun command with the appropriate flags, such as --ntasks or --ntasks-per-node:
7680

7781
```
78-
[hpc_user@r2i2n5 ~]$ module purge; module load paraview
79-
[hpc_user@r2i2n5 ~]$ srun --ntasks=20 --ntasks-per-node=10 pvserver --force-offscreen-rendering
82+
[hpc_user@x1008c7s6b1n0 ~]$ module purge; module load paraview
83+
[hpc_user@x1008c7s6b1n0 ~]$ srun --ntasks=20 --ntasks-per-node=10 pvserver --force-offscreen-rendering
8084
```
8185

8286
If your single-node job needs a GUI that uses X-windows:
@@ -92,50 +96,57 @@ If your multi-node job needs a GUI that uses X-windows, the least fragile mechan
9296
```
9397
$ salloc --time=20 --account=<handle> --nodes=2
9498
...
95-
[hpc_user@r3i5n13 ~]$ (your compute node r3i5n13)
99+
[hpc_user@x1008c7s6b1n0 ~]$ (your compute node x1008c7s6b1n0)
96100
```
97101

98102
Then from your local workstation:
99103

100104
```
101105
$ ssh -Y kestrel.hpc.nrel.gov
102106
...
103-
[hpc_user@el1 ~]$ ssh -Y r3i5n13 #(from login node to reserved compute node)
107+
[hpc_user@kl1 ~]$ ssh -Y x1008c7s6b1n0 #(from login node to reserved compute node)
104108
...
105-
[hpc_user@r3i5n13 ~]$ #(your compute node r3i5n13, now X11-capable)
106-
[hpc_user@r3i5n13 ~]$ xterm #(or another X11 GUI application)
109+
[hpc_user@x1008c7s6b1n0 ~]$ #(your compute node x1008c7s6b1n0, now X11-capable)
110+
[hpc_user@x1008c7s6b1n0 ~]$ xterm #(or another X11 GUI application)
107111
```
108112

109-
## Requesting Interactive GPU Nodes
113+
From a Kestrel-DAV FastX remote desktop session, you can omit the `ssh -Y kestrel.hpc.nrel.gov` above since your terminal in FastX will already be connected to a DAV (kd#) login node.
110114

111-
The following command requests interactive access to GPU nodes:
112115

113-
```
114-
[hpc_user@el2 ~] $ salloc --account=<handle> --time=5 --gres=gpu:2
115-
```
116+
## Requesting Interactive GPU Nodes
116117

117-
This next srun command inside the interactive session gives you access to the GPU devices:
118+
The following command requests interactive access to GPU nodes:
118119

119120
```
120-
[hpc_user@r104u33 ~] $ srun --gres=gpu:2 nvidia-smi
121-
Mon Oct 21 09:03:29 2019
122-
+-------------------------------------------------------------------+
123-
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
124-
|---------------------+----------------------+----------------------+
125-
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
126-
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
127-
|=====================+======================+======================|
128-
| 0 Tesla H100-PCIE... Off | 00000000:37:00.0 Off | 0 |
129-
| N/A 41C P0 38W / 250W | 0MiB / 16130MiB | 0% Default |
130-
+---------------------+----------------------+----------------------+
131-
| 1 Tesla H100-PCIE... Off | 00000000:86:00.0 Off | 0 |
132-
| N/A 40C P0 36W / 250W | 0MiB / 16130MiB | 0% Default |
133-
+---------------------+----------------------+----------------------+
121+
[hpc_user@kl2 ~] $ salloc --account=<handle> --time=5 --gpus=2
122+
```
123+
You may run the nvidia-smi command to confirm the GPUs are visible:
124+
125+
```
126+
[hpc_user@x3100c0s29b0n0 ~] $ nvidia-smi
127+
Wed Mar 12 16:20:53 2025
128+
+-----------------------------------------------------------------------------------------+
129+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
130+
|-----------------------------------------+------------------------+----------------------+
131+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
132+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
133+
| | | MIG M. |
134+
|=========================================+========================+======================|
135+
| 0 NVIDIA H100 80GB HBM3 On | 00000000:04:00.0 Off | 0 |
136+
| N/A 40C P0 71W / 699W | 0MiB / 81559MiB | 0% Default |
137+
| | | Disabled |
138+
+-----------------------------------------+------------------------+----------------------+
139+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:64:00.0 Off | 0 |
140+
| N/A 40C P0 73W / 699W | 0MiB / 81559MiB | 0% Default |
141+
| | | Disabled |
142+
+-----------------------------------------+------------------------+----------------------+
143+
144+
+-----------------------------------------------------------------------------------------+
145+
| Processes: |
146+
| GPU GI CI PID Type Process name GPU Memory |
147+
| ID ID Usage |
148+
|=========================================================================================|
149+
| No running processes found |
150+
+-----------------------------------------------------------------------------------------+
134151
135-
+-------------------------------------------------------------------+
136-
| Processes: GPU Memory |
137-
| GPU PID Type Process name Usage |
138-
|===================================================================|
139-
| No running processes found |
140-
+-------------------------------------------------------------------+
141152
```

0 commit comments

Comments
 (0)