Provision a basic PSQL dashboard for grafana #290

reosarevok · 2024-12-13T16:57:55Z

This is based on https://grafana.com/grafana/dashboards/14114-postgres-overview/ with an extra check for max query duration that seemed interesting, and is mostly intended as a proof of concept for provisioning dashboards. We can further improve the dashboard as needed.

mwiencek · 2025-01-15T19:38:26Z

This actually works quite nicely for me, though we could build on top of the metrics you added by creating a basic dashboard for them. That could also be used as an example for how to add additional dashboards to the repo. Here's one that includes a panel for the artist count:

{
  "__inputs": [
    {
      "name": "DS_PROMETHEUS",
      "label": "Prometheus",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }
  ],
  "__elements": {},
  "__requires": [
    {
      "type": "grafana",
      "id": "grafana",
      "name": "Grafana",
      "version": "11.4.0"
    },
    {
      "type": "datasource",
      "id": "prometheus",
      "name": "Prometheus",
      "version": "1.0.0"
    },
    {
      "type": "panel",
      "id": "timeseries",
      "name": "Time series",
      "version": ""
    }
  ],
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Number of rows in the artist table.",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "barWidthFactor": 0.6,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "displayName": "rows",
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "id": 1,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "pluginVersion": "11.4.0",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "artist_count",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "legendFormat": "__auto",
          "range": true,
          "refId": "A",
          "useBackend": false
        }
      ],
      "title": "Artist Count",
      "type": "timeseries"
    }
  ],
  "schemaVersion": 40,
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "browser",
  "title": "Table Row Counts",
  "uid": "cea3e99siyr5sa",
  "version": 2,
  "weekStart": ""
}

For comparing with Solr metrics like number of documents per Solr core, we probably only need a metric for each entity table corresponding to a Solr core?
Currently these row count metrics are collected every 30s, which seems too frequent.

reosarevok · 2025-01-16T13:26:28Z

That seems like a good start. I expanded and provisioned the dashboard, adding all the counts that seem to be relevant as per https://github.com/metabrainz/sir/blob/e9e63641cd103c29a1aca456fb870d9f7d508774/sir/schema/__init__.py - and changed sql-exporter to only check stuff every 5 minutes.

mwiencek · 2025-01-29T03:57:23Z

default/sql-exporter.yml

@@ -4,7 +4,7 @@ global:
  # timing out first.
  scrape_timeout_offset: 500ms
  # Minimum interval between collector runs: by default (0s) collectors are executed on every scrape.
-  min_interval: 0s
+  min_interval: 300s


We should probably align this with prometheus's scrape_interval too (and grafana's $__interval when fetching the data points). Since there's no point in producing duplicate data points for the same scraped value.

Edit: Pushed a commit to try and do that.

mwiencek · 2025-01-29T04:04:36Z

Now that I'm looking at the row counts dashboard again, I do lean towards defaulting to the time series view: it seems you can only view that by editing the panel or clicking "explore," which isn't very obvious, though I may be missing something. The main reason though is that the graph is more useful, because it shows you if the data is actually changing (which can be useful for determining if replication is working, for example).

reosarevok · 2025-01-29T10:33:00Z

Ok, I still think that's a niche case but I'm not going to be the one using this often, so sure, why not :) Changed that with an extra commit.

yvanzo

Thanks for having created these dashboards, and for the graph view.

The interval (5 min) might be too long for analyzing what is going on with replication. What do you think @mwiencek?

Also made a comment about service dependencies.

Anyway, feel free to make these checks and changes after merging if it makes it easier to move on with other pull requests.

yvanzo · 2025-01-29T17:06:00Z

compose/monitoring.yml

+    depends_on:
+      prometheus:
+        condition: service_started
+      db:
+        condition: service_started


That makes sql-exporter starts after these services, which is great.
Is sql-exporter robust enough to not stop if the database musicbrainz is not available yet?
Otherwise a custom condition or health check might be needed.

I tested this (as per @mwiencek's suggestion) by taking the services down, renaming the database, taking them back up, and then after a while renaming the database back to musicbrainz_db. Both errored while unable to connect, but eventually connected back and started showing data again, so they seem resilient enough.

mwiencek · 2025-01-29T17:33:10Z

The interval (5 min) might be too long for analyzing what is going on with replication. What do you think @mwiencek?

I thought it seemed reasonable for hourly replication, as 5m is already creating 12 duplicate data points per hour if replication is operating normally via cron.

I could see it being too long in a couple cases though:

The mirror is behind and is applying several replication packets per 5m interval
You're running LoadReplicationChanges by hand and have to wait 5m for the stats to update

yvanzo · 2025-01-29T17:47:01Z

for hourly replication

The main goal was to allow for debugging and profiling in development setup, where replication packets can be replayed much faster.

Allowing for monitoring actual mirrors would be great too, but possibly different dashboards then?

This is based on https://grafana.com/grafana/dashboards/14114-postgres-overview/ with an extra check for max query duration that seemed interesting, and is mostly intended as a proof of concept for provisioning dashboards. We can further improve the dashboard as needed.

As a start, monitor the amount of rows on sir-indexed tables. Includes a dashboard with gauges for every table; I don't see a reason why it would be useful to have these be line charts since there's no reason we should expect huge jumps, it's just good to have a clear idea of which tables are bigger with the numbers.

There seems to be no good reason why we would keep hitting the DB every 30 seconds to get the counts. 5 minutes seems more than enough. My understanding is that if I set min_interval here to 300s (5m) it will just keep the value for that long and keep responding with it, however often prometheus asks.

This will make the container come up when grafana does, I understand.

The rest of the team feels time series can be useful for row counts, so this changes the dashboard to use time series graphs instead of gauges.

yvanzo · 2025-01-31T16:28:10Z

I merged SIR dev stuff into master, rebased the target branch monitoring on it, and rebased the source branch provision-psql-dashboard on it to resolve conflicts.

mwiencek · 2025-01-31T17:13:22Z

Will merge this as discussed yesterday so I can rebase my other PRs. If further changes are needed they can be made in the monitoring branch.

reosarevok force-pushed the provision-psql-dashboard branch from c40a2c7 to cbb90c9 Compare January 16, 2025 13:23

mwiencek mentioned this pull request Jan 22, 2025

Solr metrics #291

Merged

mwiencek reviewed Jan 29, 2025

View reviewed changes

mwiencek force-pushed the provision-psql-dashboard branch from 53512fa to 7c47e94 Compare January 29, 2025 04:10

yvanzo reviewed Jan 29, 2025

View reviewed changes

yvanzo force-pushed the monitoring branch from 644fd4b to 0a398ce Compare January 31, 2025 16:20

reosarevok and others added 6 commits January 31, 2025 17:22

Add sql-exporter dependency for grafana

4a4ffe8

This will make the container come up when grafana does, I understand.

Align Prometheus, SQL Exporter, and Grafana intervals to 5m

7f2e81e

Use time series for table row counts

97de202

The rest of the team feels time series can be useful for row counts, so this changes the dashboard to use time series graphs instead of gauges.

yvanzo force-pushed the provision-psql-dashboard branch from 34efe32 to 97de202 Compare January 31, 2025 16:22

mwiencek merged commit c1464d1 into monitoring Jan 31, 2025

mwiencek deleted the provision-psql-dashboard branch January 31, 2025 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provision a basic PSQL dashboard for grafana #290

Provision a basic PSQL dashboard for grafana #290

reosarevok commented Dec 13, 2024

mwiencek commented Jan 15, 2025

reosarevok commented Jan 16, 2025

mwiencek Jan 29, 2025 •

edited

Loading

mwiencek commented Jan 29, 2025

reosarevok commented Jan 29, 2025

yvanzo left a comment

yvanzo Jan 29, 2025

reosarevok Feb 4, 2025

mwiencek commented Jan 29, 2025

yvanzo commented Jan 29, 2025

yvanzo commented Jan 31, 2025

mwiencek commented Jan 31, 2025

Provision a basic PSQL dashboard for grafana #290

Provision a basic PSQL dashboard for grafana #290

Conversation

reosarevok commented Dec 13, 2024

mwiencek commented Jan 15, 2025

reosarevok commented Jan 16, 2025

mwiencek Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

mwiencek commented Jan 29, 2025

reosarevok commented Jan 29, 2025

yvanzo left a comment

Choose a reason for hiding this comment

yvanzo Jan 29, 2025

Choose a reason for hiding this comment

reosarevok Feb 4, 2025

Choose a reason for hiding this comment

mwiencek commented Jan 29, 2025

yvanzo commented Jan 29, 2025

yvanzo commented Jan 31, 2025

mwiencek commented Jan 31, 2025

mwiencek Jan 29, 2025 •

edited

Loading