-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add new gauge metric for alertmanager config size #9267
Conversation
550ff6d
to
ed8da98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! A few comments:
pkg/alertmanager/multitenant.go
Outdated
configSize: promauto.With(registerer).NewGaugeVec(prometheus.GaugeOpts{ | ||
Name: "cortex_alertmanager_config_size_bytes", | ||
Help: "Size of the last received alertmanager configuration.", | ||
}, []string{"user"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is a bit ambiguous, it's not clear that we're only using this for Grafana configurations. Maybe something like cortex_alertmanager_grafana_config_size_bytes
is enough to avoid confusions.
The multitenantAlertmanagerMetrics
struct might be a better place for defining this metric. We could add a new field (e.g. grafanaConfigSize
) and initialize the gauge in newMultitenantAlertmanagerMetrics()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is a bit ambiguous, it's not clear that we're only using this for Grafana configurations. Maybe something like cortex_alertmanager_grafana_config_size_bytes is enough to avoid confusions.
Yep, good point - changed now 👍
The multitenantAlertmanagerMetrics struct might be a better place for defining this metric. We could add a new field (e.g. grafanaConfigSize) and initialize the gauge in newMultitenantAlertmanagerMetrics()
Agreed. Also fixed now 👍
pkg/alertmanager/api_grafana.go
Outdated
@@ -316,6 +316,7 @@ func (am *MultitenantAlertmanager) SetUserGrafanaConfig(w http.ResponseWriter, r | |||
util.WriteJSONResponse(w, errorResult{Status: statusError, Error: fmt.Sprintf("%s: %s", errReadingGrafanaConfig, err.Error())}) | |||
return | |||
} | |||
am.multitenantMetrics.grafanaConfigSize.WithLabelValues(userID).Set(float64(len(payload))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few problems with this approach:
- If the AM restarts, the Grafana configuration might be there, but we won't get metrics for it until the next POST request from Grafana to update the config
- The metric doesn't change when the configuration is deleted
What if we updated this metric during the config sync for each tenant? E.g. in computeConfig()
, by setting it to the len(cfgs.Grafana.RawConfig)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, moved that to computeConfig
now 👍
pkg/alertmanager/multitenant.go
Outdated
@@ -721,10 +728,12 @@ func (am *MultitenantAlertmanager) computeConfig(cfgs alertspb.AlertConfigDescs) | |||
// Grafana configuration. | |||
case cfgs.Mimir.RawConfig == am.fallbackConfig: | |||
level.Debug(am.logger).Log("msg", "mimir configuration is default, using grafana config with the default globals", "user", cfgs.Mimir.User) | |||
am.multitenantMetrics.grafanaConfigSize.WithLabelValues(cfgs.Grafana.User).Set(float64(len(cfgs.Grafana.RawConfig))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do this once at the top of the method, before entering the switch
. Even if the Grafana config is ignored, we'll get the size in bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I wasn't sure if we actually wanted to record it in those cases.
I've moved it out of computeConfigs
and into syncConfigs
instead, mainly because of the user reference from the cfgMap
but also considering there are other metrics being recorded there. Please let me know if you agree 🙌
Co-authored-by: Santiago <[email protected]>
https://github.com/grafana/alerting-squad/issues/907