Host metrics #293

mwiencek · 2025-01-29T06:06:43Z

This is based on #291.

The service definition was based on https://github.com/metabrainz/prometheus-exp/blob/main/node.sh

This is based on https://grafana.com/grafana/dashboards/14114-postgres-overview/ with an extra check for max query duration that seemed interesting, and is mostly intended as a proof of concept for provisioning dashboards. We can further improve the dashboard as needed.

As a start, monitor the amount of rows on sir-indexed tables. Includes a dashboard with gauges for every table; I don't see a reason why it would be useful to have these be line charts since there's no reason we should expect huge jumps, it's just good to have a clear idea of which tables are bigger with the numbers.

There seems to be no good reason why we would keep hitting the DB every 30 seconds to get the counts. 5 minutes seems more than enough. My understanding is that if I set min_interval here to 300s (5m) it will just keep the value for that long and keep responding with it, however often prometheus asks.

This will make the container come up when grafana does, I understand.

mwiencek · 2025-01-29T06:09:42Z

The only issue I have with this is the node-exporter logs being spammed with the following:

node-exporter  | ts=2025-01-29T06:07:03.163Z caller=stdlib.go:105 level=error msg="error gathering metrics: 21 error(s) occurred:\n* [from Gatherer #2] collected metric \"node_filesystem_device_error\" { label:{name:\"device\"  value:\"devpts\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"devpts\"}  label:{name:\"mountpoint\"  value:\"/dev/pts\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_readonly\" { label:{name:\"device\"  value:\"devpts\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"devpts\"}  label:{name:\"mountpoint\"  value:\"/dev/pts\"}  gauge:{value:1}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_size_bytes\" { label:{name:\"device\"  value:\"devpts\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"devpts\"}  label:{name:\"mountpoint\"  value:\"/dev/pts\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_free_bytes\" { label:{name:\"device\"  value:\"devpts\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"devpts\"}  label:{name:\"mountpoint\"  value:\"/dev/pts\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_avail_bytes\" { label:{name:\"device\"  value:\"devpts\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"devpts\"}  label:{name:\"mountpoint\"  value:\"/dev/pts\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_files\" { label:{name:\"device\"  value:\"devpts\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"devpts\"}  label:{name:\"mountpoint\"  value:\"/dev/pts\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_files_free\" { label:{name:\"device\"  value:\"devpts\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"devpts\"}  label:{name:\"mountpoint\"  value:\"/dev/pts\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_device_error\" { label:{name:\"device\"  value:\"mqueue\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"mqueue\"}  label:{name:\"mountpoint\"  value:\"/dev/mqueue\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_readonly\" { label:{name:\"device\"  value:\"mqueue\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"mqueue\"}  label:{name:\"mountpoint\"  value:\"/dev/mqueue\"}  gauge:{value:1}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_size_bytes\" { label:{name:\"device\"  value:\"mqueue\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"mqueue\"}  label:{name:\"mountpoint\"  value:\"/dev/mqueue\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_free_bytes\" { label:{name:\"device\"  value:\"mqueue\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"mqueue\"}  label:{name:\"mountpoint\"  value:\"/dev/mqueue\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_avail_bytes\" { label:{name:\"device\"  value:\"mqueue\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"mqueue\"}  label:{name:\"mountpoint\"  value:\"/dev/mqueue\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_files\" { label:{name:\"device\"  value:\"mqueue\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"mqueue\"}  label:{name:\"mountpoint\"  value:\"/dev/mqueue\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_files_free\" { label:{name:\"device\"  value:\"mqueue\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"mqueue\"}  label:{name:\"mountpoint\"  value:\"/dev/mqueue\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_device_error\" { label:{name:\"device\"  value:\"proc\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"proc\"}  label:{name:\"mountpoint\"  value:\"/proc\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_readonly\" { label:{name:\"device\"  value:\"proc\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"proc\"}  label:{name:\"mountpoint\"  value:\"/proc\"}  gauge:{value:1}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_size_bytes\" { label:{name:\"device\"  value:\"proc\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"proc\"}  label:{name:\"mountpoint\"  value:\"/proc\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_free_bytes\" { label:{name:\"device\"  value:\"proc\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"proc\"}  label:{name:\"mountpoint\"  value:\"/proc\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_avail_bytes\" { label:{name:\"device\"  value:\"proc\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"proc\"}  label:{name:\"mountpoint\"  value:\"/proc\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_files\" { label:{name:\"device\"  value:\"proc\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"proc\"}  label:{name:\"mountpoint\"  value:\"/proc\"}  gauge:{value:0}} was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_filesystem_files_free\" { label:{name:\"device\"  value:\"proc\"}  label:{name:\"device_error\"  value:\"\"}  label:{name:\"fstype\"  value:\"proc\"}  label:{name:\"mountpoint\"  value:\"/proc\"}  gauge:{value:0}} was collected before with the same name and label values"

I have no idea why this is happening or how to resolve it -- devpts for example is already listed in --collector.filesystem.ignored-fs-types, so I don't know why it's being collected.

reosarevok and others added 9 commits January 28, 2025 22:10

Add sql-exporter dependency for grafana

a120193

This will make the container come up when grafana does, I understand.

Align Prometheus, SQL Exporter, and Grafana intervals to 5m

7c47e94

TMP: Use locally-built solr-9.7.0 image

9782be4

Add basic search healthcheck

e28775e

Solr metrics

f0fe505

Add host metrics (via node-exporter)

ea491a8

mwiencek changed the base branch from master to monitoring January 29, 2025 06:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host metrics #293

Host metrics #293

mwiencek commented Jan 29, 2025

mwiencek commented Jan 29, 2025

Host metrics #293

Are you sure you want to change the base?

Host metrics #293

Conversation

mwiencek commented Jan 29, 2025

mwiencek commented Jan 29, 2025