11# SLURM Prometheus Exporter
22
3- A Prometheus exporter for SLURM cluster metrics. Collects node and job information using ` scontrol ` and ` squeue ` CLI commands.
3+ A Prometheus exporter for SLURM cluster metrics. Collects aggregate node and job counts by state using ` scontrol ` and ` squeue ` CLI commands.
44
55## Usage
66
@@ -9,9 +9,10 @@ A Prometheus exporter for SLURM cluster metrics. Collects node and job informati
99``` bash
1010docker run -d \
1111 --name slurm-exporter \
12- -p 9341:9341 \
13- -v /etc /slurm:/etc /slurm:ro \
12+ --network host \
13+ -v /data /slurm:/data /slurm:ro \
1414 -v /var/run/munge:/var/run/munge:ro \
15+ -e SLURM_CONF=/data/slurm/etc/slurm.conf \
1516 ghcr.io/primeintellect-ai/slurm-exporter:latest \
1617 --cluster mycluster
1718```
@@ -36,74 +37,42 @@ uv run slurm-exporter --cluster mycluster --port 9341
3637
3738## Metrics
3839
39- ### Node Metrics
40-
41- | Metric | Type | Labels | Description |
42- | --------| ------| --------| -------------|
43- | ` slurm_node_state ` | Gauge | ` cluster ` , ` node ` , ` state ` | SLURM node state (1 if node is in this state) |
44- | ` slurm_node_cpus_total ` | Gauge | ` cluster ` , ` node ` | Total CPUs on the node |
45- | ` slurm_node_cpus_allocated ` | Gauge | ` cluster ` , ` node ` | Allocated CPUs on the node |
46- | ` slurm_node_memory_total_bytes ` | Gauge | ` cluster ` , ` node ` | Total memory on the node in bytes |
47- | ` slurm_node_memory_allocated_bytes ` | Gauge | ` cluster ` , ` node ` | Allocated memory on the node in bytes |
48- | ` slurm_node_gpus_total ` | Gauge | ` cluster ` , ` node ` , ` gpu_type ` | Total GPUs on the node |
49- | ` slurm_node_gpus_allocated ` | Gauge | ` cluster ` , ` node ` , ` gpu_type ` | Allocated GPUs on the node |
50-
51- ### Job Metrics
52-
5340| Metric | Type | Labels | Description |
5441| --------| ------| --------| -------------|
55- | ` slurm_job_state ` | Gauge | ` cluster ` , ` job ` , ` state ` | SLURM job state (1 if job is in this state) |
56- | ` slurm_jobs_total ` | Counter | ` cluster ` , ` state ` | Total number of jobs by state |
57- | ` slurm_job_cpus ` | Gauge | ` cluster ` , ` job ` , ` state ` | Number of CPUs allocated to the job |
58- | ` slurm_job_memory_bytes ` | Gauge | ` cluster ` , ` job ` , ` state ` | Memory allocated to the job in bytes |
59- | ` slurm_job_gpus ` | Gauge | ` cluster ` , ` job ` , ` state ` , ` gpu_type ` | Number of GPUs allocated to the job |
60- | ` slurm_job_nodes ` | Gauge | ` cluster ` , ` job ` , ` state ` | Number of nodes allocated to the job |
61-
62- ## Example PromQL Queries
42+ | ` slurm_nodes ` | Gauge | ` cluster ` , ` state ` | Number of nodes by state |
43+ | ` slurm_jobs ` | Gauge | ` cluster ` , ` state ` , ` user ` | Number of jobs by state and user |
6344
64- ### Cluster Utilization
45+ ### Node States
6546
66- ``` promql
67- # CPU utilization percentage
68- sum(slurm_node_cpus_allocated{cluster="mycluster"}) / sum(slurm_node_cpus_total{cluster="mycluster"}) * 100
47+ Common node states: ` idle ` , ` allocated ` , ` mixed ` , ` down ` , ` drained ` , ` draining `
6948
70- # Memory utilization percentage
71- sum(slurm_node_memory_allocated_bytes{cluster="mycluster"}) / sum(slurm_node_memory_total_bytes{cluster="mycluster"}) * 100
49+ ### Job States
7250
73- # GPU utilization percentage
74- sum(slurm_node_gpus_allocated) / sum(slurm_node_gpus_total) * 100
51+ Common job states: ` running ` , ` pending ` , ` completed ` , ` failed ` , ` cancelled ` , ` timeout `
7552
76- # GPU utilization by type
77- sum by (gpu_type) (slurm_node_gpus_allocated) / sum by (gpu_type) (slurm_node_gpus_total) * 100
78- ```
79-
80- ### Job Statistics
53+ ## Example PromQL Queries
8154
8255``` promql
83- # Jobs by state
84- slurm_jobs_total
56+ # Total nodes by state
57+ slurm_nodes
8558
86- # Running jobs count
87- slurm_jobs_total {state="running"}
59+ # Node utilization (allocated / total)
60+ sum(slurm_nodes {state="allocated"}) / sum(slurm_nodes) * 100
8861
89- # Pending jobs count
90- slurm_jobs_total {state="pending "}
62+ # Running jobs
63+ slurm_jobs {state="running "}
9164
92- # Total CPUs used by running jobs
93- sum(slurm_job_cpus{state="running"})
94- ```
65+ # Pending jobs
66+ slurm_jobs{state="pending"}
9567
96- ### Node Availability
97-
98- ``` promql
99- # Count nodes by state
100- count by (state) (slurm_node_state)
68+ # Jobs by user
69+ sum by (user) (slurm_jobs{state="running"})
10170
10271# Idle nodes
103- count(slurm_node_state {state="idle"})
72+ slurm_nodes {state="idle"}
10473
105- # Nodes with issues
106- count(slurm_node_state {state=~"down|drain| drained"})
74+ # Problematic nodes (down, drained)
75+ slurm_nodes {state=~"down|drained|draining"}
10776```
10877
10978## Building
@@ -112,9 +81,6 @@ count(slurm_node_state{state=~"down|drain|drained"})
11281# Build with Docker Bake
11382docker buildx bake
11483
115- # Build with specific SLURM version
116- SLURM_TAG=slurm-24-05-4-1 docker buildx bake
117-
11884# Push to registry
11985docker buildx bake --push
12086```
0 commit comments