Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provision a basic PSQL dashboard for grafana #290

Merged
merged 6 commits into from
Jan 31, 2025

Conversation

reosarevok
Copy link
Member

This is based on https://grafana.com/grafana/dashboards/14114-postgres-overview/ with an extra check for max query duration that seemed interesting, and is mostly intended as a proof of concept for provisioning dashboards. We can further improve the dashboard as needed.

@mwiencek
Copy link
Member

This actually works quite nicely for me, though we could build on top of the metrics you added by creating a basic dashboard for them. That could also be used as an example for how to add additional dashboards to the repo. Here's one that includes a panel for the artist count:

{
  "__inputs": [
    {
      "name": "DS_PROMETHEUS",
      "label": "Prometheus",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }
  ],
  "__elements": {},
  "__requires": [
    {
      "type": "grafana",
      "id": "grafana",
      "name": "Grafana",
      "version": "11.4.0"
    },
    {
      "type": "datasource",
      "id": "prometheus",
      "name": "Prometheus",
      "version": "1.0.0"
    },
    {
      "type": "panel",
      "id": "timeseries",
      "name": "Time series",
      "version": ""
    }
  ],
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Number of rows in the artist table.",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "barWidthFactor": 0.6,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "displayName": "rows",
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "id": 1,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "pluginVersion": "11.4.0",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "artist_count",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "legendFormat": "__auto",
          "range": true,
          "refId": "A",
          "useBackend": false
        }
      ],
      "title": "Artist Count",
      "type": "timeseries"
    }
  ],
  "schemaVersion": 40,
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "browser",
  "title": "Table Row Counts",
  "uid": "cea3e99siyr5sa",
  "version": 2,
  "weekStart": ""
}
  • For comparing with Solr metrics like number of documents per Solr core, we probably only need a metric for each entity table corresponding to a Solr core?
  • Currently these row count metrics are collected every 30s, which seems too frequent.

@reosarevok reosarevok force-pushed the provision-psql-dashboard branch from c40a2c7 to cbb90c9 Compare January 16, 2025 13:23
@reosarevok
Copy link
Member Author

That seems like a good start. I expanded and provisioned the dashboard, adding all the counts that seem to be relevant as per https://github.com/metabrainz/sir/blob/e9e63641cd103c29a1aca456fb870d9f7d508774/sir/schema/__init__.py - and changed sql-exporter to only check stuff every 5 minutes.

@mwiencek mwiencek mentioned this pull request Jan 22, 2025
@@ -4,7 +4,7 @@ global:
# timing out first.
scrape_timeout_offset: 500ms
# Minimum interval between collector runs: by default (0s) collectors are executed on every scrape.
min_interval: 0s
min_interval: 300s
Copy link
Member

@mwiencek mwiencek Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably align this with prometheus's scrape_interval too (and grafana's $__interval when fetching the data points). Since there's no point in producing duplicate data points for the same scraped value.

Edit: Pushed a commit to try and do that.

@mwiencek
Copy link
Member

Now that I'm looking at the row counts dashboard again, I do lean towards defaulting to the time series view: it seems you can only view that by editing the panel or clicking "explore," which isn't very obvious, though I may be missing something. The main reason though is that the graph is more useful, because it shows you if the data is actually changing (which can be useful for determining if replication is working, for example).

@mwiencek mwiencek force-pushed the provision-psql-dashboard branch from 53512fa to 7c47e94 Compare January 29, 2025 04:10
@reosarevok
Copy link
Member Author

Ok, I still think that's a niche case but I'm not going to be the one using this often, so sure, why not :) Changed that with an extra commit.

Copy link
Contributor

@yvanzo yvanzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having created these dashboards, and for the graph view.

The interval (5 min) might be too long for analyzing what is going on with replication. What do you think @mwiencek?

Also made a comment about service dependencies.

Anyway, feel free to make these checks and changes after merging if it makes it easier to move on with other pull requests.

Comment on lines +56 to +60
depends_on:
prometheus:
condition: service_started
db:
condition: service_started
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sql-exporter starts after these services, which is great.
Is sql-exporter robust enough to not stop if the database musicbrainz is not available yet?
Otherwise a custom condition or health check might be needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this (as per @mwiencek's suggestion) by taking the services down, renaming the database, taking them back up, and then after a while renaming the database back to musicbrainz_db. Both errored while unable to connect, but eventually connected back and started showing data again, so they seem resilient enough.

@mwiencek
Copy link
Member

The interval (5 min) might be too long for analyzing what is going on with replication. What do you think @mwiencek?

I thought it seemed reasonable for hourly replication, as 5m is already creating 12 duplicate data points per hour if replication is operating normally via cron.

I could see it being too long in a couple cases though:

  • The mirror is behind and is applying several replication packets per 5m interval
  • You're running LoadReplicationChanges by hand and have to wait 5m for the stats to update

@yvanzo
Copy link
Contributor

yvanzo commented Jan 29, 2025

for hourly replication

The main goal was to allow for debugging and profiling in development setup, where replication packets can be replayed much faster.

Allowing for monitoring actual mirrors would be great too, but possibly different dashboards then?

reosarevok and others added 6 commits January 31, 2025 17:22
This is based on https://grafana.com/grafana/dashboards/14114-postgres-overview/
with an extra check for max query duration that seemed interesting,
and is mostly intended as a proof of concept for provisioning dashboards.
We can further improve the dashboard as needed.
As a start, monitor the amount of rows on sir-indexed tables.

Includes a dashboard with gauges for every table; I don't see
a reason why it would be useful to have these be line charts since
there's no reason we should expect huge jumps, it's just good
to have a clear idea of which tables are bigger with the numbers.
There seems to be no good reason why we would keep hitting the DB
every 30 seconds to get the counts. 5 minutes seems more than enough.

My understanding is that if I set min_interval here to 300s (5m)
it will just keep the value for that long and keep responding with it,
however often prometheus asks.
This will make the container come up when grafana does, I understand.
The rest of the team feels time series can be useful for row counts,
so this changes the dashboard to use time series graphs instead of gauges.
@yvanzo yvanzo force-pushed the provision-psql-dashboard branch from 34efe32 to 97de202 Compare January 31, 2025 16:22
@yvanzo
Copy link
Contributor

yvanzo commented Jan 31, 2025

I merged SIR dev stuff into master, rebased the target branch monitoring on it, and rebased the source branch provision-psql-dashboard on it to resolve conflicts.

@mwiencek
Copy link
Member

Will merge this as discussed yesterday so I can rebase my other PRs. If further changes are needed they can be made in the monitoring branch.

@mwiencek mwiencek merged commit c1464d1 into monitoring Jan 31, 2025
@mwiencek mwiencek deleted the provision-psql-dashboard branch January 31, 2025 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants