-
-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provision a basic PSQL dashboard for grafana #290
Conversation
This actually works quite nicely for me, though we could build on top of the metrics you added by creating a basic dashboard for them. That could also be used as an example for how to add additional dashboards to the repo. Here's one that includes a panel for the artist count: {
"__inputs": [
{
"name": "DS_PROMETHEUS",
"label": "Prometheus",
"description": "",
"type": "datasource",
"pluginId": "prometheus",
"pluginName": "Prometheus"
}
],
"__elements": {},
"__requires": [
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "11.4.0"
},
{
"type": "datasource",
"id": "prometheus",
"name": "Prometheus",
"version": "1.0.0"
},
{
"type": "panel",
"id": "timeseries",
"name": "Time series",
"version": ""
}
],
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"description": "Number of rows in the artist table.",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"displayName": "rows",
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"pluginVersion": "11.4.0",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"disableTextWrap": false,
"editorMode": "builder",
"expr": "artist_count",
"fullMetaSearch": false,
"includeNullMetadata": true,
"legendFormat": "__auto",
"range": true,
"refId": "A",
"useBackend": false
}
],
"title": "Artist Count",
"type": "timeseries"
}
],
"schemaVersion": 40,
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {},
"timezone": "browser",
"title": "Table Row Counts",
"uid": "cea3e99siyr5sa",
"version": 2,
"weekStart": ""
}
|
c40a2c7
to
cbb90c9
Compare
That seems like a good start. I expanded and provisioned the dashboard, adding all the counts that seem to be relevant as per https://github.com/metabrainz/sir/blob/e9e63641cd103c29a1aca456fb870d9f7d508774/sir/schema/__init__.py - and changed sql-exporter to only check stuff every 5 minutes. |
default/sql-exporter.yml
Outdated
@@ -4,7 +4,7 @@ global: | |||
# timing out first. | |||
scrape_timeout_offset: 500ms | |||
# Minimum interval between collector runs: by default (0s) collectors are executed on every scrape. | |||
min_interval: 0s | |||
min_interval: 300s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably align this with prometheus's scrape_interval too (and grafana's $__interval when fetching the data points). Since there's no point in producing duplicate data points for the same scraped value.
Edit: Pushed a commit to try and do that.
Now that I'm looking at the row counts dashboard again, I do lean towards defaulting to the time series view: it seems you can only view that by editing the panel or clicking "explore," which isn't very obvious, though I may be missing something. The main reason though is that the graph is more useful, because it shows you if the data is actually changing (which can be useful for determining if replication is working, for example). |
53512fa
to
7c47e94
Compare
Ok, I still think that's a niche case but I'm not going to be the one using this often, so sure, why not :) Changed that with an extra commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for having created these dashboards, and for the graph view.
The interval (5 min) might be too long for analyzing what is going on with replication. What do you think @mwiencek?
Also made a comment about service dependencies.
Anyway, feel free to make these checks and changes after merging if it makes it easier to move on with other pull requests.
depends_on: | ||
prometheus: | ||
condition: service_started | ||
db: | ||
condition: service_started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sql-exporter
starts after these services, which is great.
Is sql-exporter
robust enough to not stop if the database musicbrainz
is not available yet?
Otherwise a custom condition or health check might be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this (as per @mwiencek's suggestion) by taking the services down, renaming the database, taking them back up, and then after a while renaming the database back to musicbrainz_db
. Both errored while unable to connect, but eventually connected back and started showing data again, so they seem resilient enough.
I thought it seemed reasonable for hourly replication, as 5m is already creating 12 duplicate data points per hour if replication is operating normally via cron. I could see it being too long in a couple cases though:
|
The main goal was to allow for debugging and profiling in development setup, where replication packets can be replayed much faster. Allowing for monitoring actual mirrors would be great too, but possibly different dashboards then? |
This is based on https://grafana.com/grafana/dashboards/14114-postgres-overview/ with an extra check for max query duration that seemed interesting, and is mostly intended as a proof of concept for provisioning dashboards. We can further improve the dashboard as needed.
As a start, monitor the amount of rows on sir-indexed tables. Includes a dashboard with gauges for every table; I don't see a reason why it would be useful to have these be line charts since there's no reason we should expect huge jumps, it's just good to have a clear idea of which tables are bigger with the numbers.
There seems to be no good reason why we would keep hitting the DB every 30 seconds to get the counts. 5 minutes seems more than enough. My understanding is that if I set min_interval here to 300s (5m) it will just keep the value for that long and keep responding with it, however often prometheus asks.
This will make the container come up when grafana does, I understand.
The rest of the team feels time series can be useful for row counts, so this changes the dashboard to use time series graphs instead of gauges.
34efe32
to
97de202
Compare
I merged SIR dev stuff into |
Will merge this as discussed yesterday so I can rebase my other PRs. If further changes are needed they can be made in the |
This is based on https://grafana.com/grafana/dashboards/14114-postgres-overview/ with an extra check for max query duration that seemed interesting, and is mostly intended as a proof of concept for provisioning dashboards. We can further improve the dashboard as needed.