PrimeIntellect-ai
diff --git a/‎README.md‎
Lines changed: 24 additions & 58 deletions b/‎README.md‎
Lines changed: 24 additions & 58 deletions
@@ -1,6 +1,6 @@
 # SLURM Prometheus Exporter
 
-A Prometheus exporter for SLURM cluster metrics. Collects node and job information using `scontrol` and `squeue` CLI commands.
+A Prometheus exporter for SLURM cluster metrics. Collects aggregate node and job counts by state using `scontrol` and `squeue` CLI commands.
 
 ## Usage
 
@@ -9,9 +9,10 @@ A Prometheus exporter for SLURM cluster metrics. Collects node and job informati
 ```bash
 docker run -d \
   --name slurm-exporter \
-  -p 9341:9341 \
-  -v /etc/slurm:/etc/slurm:ro \
+  --network host \
+  -v /data/slurm:/data/slurm:ro \
   -v /var/run/munge:/var/run/munge:ro \
+  -e SLURM_CONF=/data/slurm/etc/slurm.conf \
   ghcr.io/primeintellect-ai/slurm-exporter:latest \
   --cluster mycluster
 ```
@@ -36,74 +37,42 @@ uv run slurm-exporter --cluster mycluster --port 9341
 
 ## Metrics
 
-### Node Metrics
-
-| Metric | Type | Labels | Description |
-|--------|------|--------|-------------|
-| `slurm_node_state` | Gauge | `cluster`, `node`, `state` | SLURM node state (1 if node is in this state) |
-| `slurm_node_cpus_total` | Gauge | `cluster`, `node` | Total CPUs on the node |
-| `slurm_node_cpus_allocated` | Gauge | `cluster`, `node` | Allocated CPUs on the node |
-| `slurm_node_memory_total_bytes` | Gauge | `cluster`, `node` | Total memory on the node in bytes |
-| `slurm_node_memory_allocated_bytes` | Gauge | `cluster`, `node` | Allocated memory on the node in bytes |
-| `slurm_node_gpus_total` | Gauge | `cluster`, `node`, `gpu_type` | Total GPUs on the node |
-| `slurm_node_gpus_allocated` | Gauge | `cluster`, `node`, `gpu_type` | Allocated GPUs on the node |
-
-### Job Metrics
-
 | Metric | Type | Labels | Description |
 |--------|------|--------|-------------|
-| `slurm_job_state` | Gauge | `cluster`, `job`, `state` | SLURM job state (1 if job is in this state) |
-| `slurm_jobs_total` | Counter | `cluster`, `state` | Total number of jobs by state |
-| `slurm_job_cpus` | Gauge | `cluster`, `job`, `state` | Number of CPUs allocated to the job |
-| `slurm_job_memory_bytes` | Gauge | `cluster`, `job`, `state` | Memory allocated to the job in bytes |
-| `slurm_job_gpus` | Gauge | `cluster`, `job`, `state`, `gpu_type` | Number of GPUs allocated to the job |
-| `slurm_job_nodes` | Gauge | `cluster`, `job`, `state` | Number of nodes allocated to the job |
-
-## Example PromQL Queries
+| `slurm_nodes` | Gauge | `cluster`, `state` | Number of nodes by state |
+| `slurm_jobs` | Gauge | `cluster`, `state`, `user` | Number of jobs by state and user |
 
-### Cluster Utilization
+### Node States
 
-```promql
-# CPU utilization percentage
-sum(slurm_node_cpus_allocated{cluster="mycluster"}) / sum(slurm_node_cpus_total{cluster="mycluster"}) * 100
+Common node states: `idle`, `allocated`, `mixed`, `down`, `drained`, `draining`
 
-# Memory utilization percentage
-sum(slurm_node_memory_allocated_bytes{cluster="mycluster"}) / sum(slurm_node_memory_total_bytes{cluster="mycluster"}) * 100
+### Job States
 
-# GPU utilization percentage
-sum(slurm_node_gpus_allocated) / sum(slurm_node_gpus_total) * 100
+Common job states: `running`, `pending`, `completed`, `failed`, `cancelled`, `timeout`
 
-# GPU utilization by type
-sum by (gpu_type) (slurm_node_gpus_allocated) / sum by (gpu_type) (slurm_node_gpus_total) * 100
-```
-
-### Job Statistics
+## Example PromQL Queries
 
 ```promql
-# Jobs by state
-slurm_jobs_total
+# Total nodes by state
+slurm_nodes
 
-# Running jobs count
-slurm_jobs_total{state="running"}
+# Node utilization (allocated / total)
+sum(slurm_nodes{state="allocated"}) / sum(slurm_nodes) * 100
 
-# Pending jobs count
-slurm_jobs_total{state="pending"}
+# Running jobs
+slurm_jobs{state="running"}
 
-# Total CPUs used by running jobs
-sum(slurm_job_cpus{state="running"})
-```
+# Pending jobs
+slurm_jobs{state="pending"}
 
-### Node Availability
-
-```promql
-# Count nodes by state
-count by (state) (slurm_node_state)
+# Jobs by user
+sum by (user) (slurm_jobs{state="running"})
 
 # Idle nodes
-count(slurm_node_state{state="idle"})
+slurm_nodes{state="idle"}
 
-# Nodes with issues
-count(slurm_node_state{state=~"down|drain|drained"})
+# Problematic nodes (down, drained)
+slurm_nodes{state=~"down|drained|draining"}
 ```
 
 ## Building
@@ -112,9 +81,6 @@ count(slurm_node_state{state=~"down|drain|drained"})
 # Build with Docker Bake
 docker buildx bake
 
-# Build with specific SLURM version
-SLURM_TAG=slurm-24-05-4-1 docker buildx bake
-
 # Push to registry
 docker buildx bake --push
 ```