Skip to content

Commit 21aa2fa

Browse files
committed
simplify metrics to aggregate counts by state
- Remove per-node and per-job metrics to reduce cardinality - Add user label to job metrics - slurm_nodes: count by cluster, state - slurm_jobs: count by cluster, state, user - Simplify slurm_client to only fetch required fields - Update README with new metrics and example queries
1 parent 0d0fcfc commit 21aa2fa

File tree

3 files changed

+48
-323
lines changed

3 files changed

+48
-323
lines changed

README.md

Lines changed: 24 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# SLURM Prometheus Exporter
22

3-
A Prometheus exporter for SLURM cluster metrics. Collects node and job information using `scontrol` and `squeue` CLI commands.
3+
A Prometheus exporter for SLURM cluster metrics. Collects aggregate node and job counts by state using `scontrol` and `squeue` CLI commands.
44

55
## Usage
66

@@ -9,9 +9,10 @@ A Prometheus exporter for SLURM cluster metrics. Collects node and job informati
99
```bash
1010
docker run -d \
1111
--name slurm-exporter \
12-
-p 9341:9341 \
13-
-v /etc/slurm:/etc/slurm:ro \
12+
--network host \
13+
-v /data/slurm:/data/slurm:ro \
1414
-v /var/run/munge:/var/run/munge:ro \
15+
-e SLURM_CONF=/data/slurm/etc/slurm.conf \
1516
ghcr.io/primeintellect-ai/slurm-exporter:latest \
1617
--cluster mycluster
1718
```
@@ -36,74 +37,42 @@ uv run slurm-exporter --cluster mycluster --port 9341
3637

3738
## Metrics
3839

39-
### Node Metrics
40-
41-
| Metric | Type | Labels | Description |
42-
|--------|------|--------|-------------|
43-
| `slurm_node_state` | Gauge | `cluster`, `node`, `state` | SLURM node state (1 if node is in this state) |
44-
| `slurm_node_cpus_total` | Gauge | `cluster`, `node` | Total CPUs on the node |
45-
| `slurm_node_cpus_allocated` | Gauge | `cluster`, `node` | Allocated CPUs on the node |
46-
| `slurm_node_memory_total_bytes` | Gauge | `cluster`, `node` | Total memory on the node in bytes |
47-
| `slurm_node_memory_allocated_bytes` | Gauge | `cluster`, `node` | Allocated memory on the node in bytes |
48-
| `slurm_node_gpus_total` | Gauge | `cluster`, `node`, `gpu_type` | Total GPUs on the node |
49-
| `slurm_node_gpus_allocated` | Gauge | `cluster`, `node`, `gpu_type` | Allocated GPUs on the node |
50-
51-
### Job Metrics
52-
5340
| Metric | Type | Labels | Description |
5441
|--------|------|--------|-------------|
55-
| `slurm_job_state` | Gauge | `cluster`, `job`, `state` | SLURM job state (1 if job is in this state) |
56-
| `slurm_jobs_total` | Counter | `cluster`, `state` | Total number of jobs by state |
57-
| `slurm_job_cpus` | Gauge | `cluster`, `job`, `state` | Number of CPUs allocated to the job |
58-
| `slurm_job_memory_bytes` | Gauge | `cluster`, `job`, `state` | Memory allocated to the job in bytes |
59-
| `slurm_job_gpus` | Gauge | `cluster`, `job`, `state`, `gpu_type` | Number of GPUs allocated to the job |
60-
| `slurm_job_nodes` | Gauge | `cluster`, `job`, `state` | Number of nodes allocated to the job |
61-
62-
## Example PromQL Queries
42+
| `slurm_nodes` | Gauge | `cluster`, `state` | Number of nodes by state |
43+
| `slurm_jobs` | Gauge | `cluster`, `state`, `user` | Number of jobs by state and user |
6344

64-
### Cluster Utilization
45+
### Node States
6546

66-
```promql
67-
# CPU utilization percentage
68-
sum(slurm_node_cpus_allocated{cluster="mycluster"}) / sum(slurm_node_cpus_total{cluster="mycluster"}) * 100
47+
Common node states: `idle`, `allocated`, `mixed`, `down`, `drained`, `draining`
6948

70-
# Memory utilization percentage
71-
sum(slurm_node_memory_allocated_bytes{cluster="mycluster"}) / sum(slurm_node_memory_total_bytes{cluster="mycluster"}) * 100
49+
### Job States
7250

73-
# GPU utilization percentage
74-
sum(slurm_node_gpus_allocated) / sum(slurm_node_gpus_total) * 100
51+
Common job states: `running`, `pending`, `completed`, `failed`, `cancelled`, `timeout`
7552

76-
# GPU utilization by type
77-
sum by (gpu_type) (slurm_node_gpus_allocated) / sum by (gpu_type) (slurm_node_gpus_total) * 100
78-
```
79-
80-
### Job Statistics
53+
## Example PromQL Queries
8154

8255
```promql
83-
# Jobs by state
84-
slurm_jobs_total
56+
# Total nodes by state
57+
slurm_nodes
8558
86-
# Running jobs count
87-
slurm_jobs_total{state="running"}
59+
# Node utilization (allocated / total)
60+
sum(slurm_nodes{state="allocated"}) / sum(slurm_nodes) * 100
8861
89-
# Pending jobs count
90-
slurm_jobs_total{state="pending"}
62+
# Running jobs
63+
slurm_jobs{state="running"}
9164
92-
# Total CPUs used by running jobs
93-
sum(slurm_job_cpus{state="running"})
94-
```
65+
# Pending jobs
66+
slurm_jobs{state="pending"}
9567
96-
### Node Availability
97-
98-
```promql
99-
# Count nodes by state
100-
count by (state) (slurm_node_state)
68+
# Jobs by user
69+
sum by (user) (slurm_jobs{state="running"})
10170
10271
# Idle nodes
103-
count(slurm_node_state{state="idle"})
72+
slurm_nodes{state="idle"}
10473
105-
# Nodes with issues
106-
count(slurm_node_state{state=~"down|drain|drained"})
74+
# Problematic nodes (down, drained)
75+
slurm_nodes{state=~"down|drained|draining"}
10776
```
10877

10978
## Building
@@ -112,9 +81,6 @@ count(slurm_node_state{state=~"down|drain|drained"})
11281
# Build with Docker Bake
11382
docker buildx bake
11483

115-
# Build with specific SLURM version
116-
SLURM_TAG=slurm-24-05-4-1 docker buildx bake
117-
11884
# Push to registry
11985
docker buildx bake --push
12086
```

0 commit comments

Comments
 (0)