-
Notifications
You must be signed in to change notification settings - Fork 414
Description
Background
As part of Element's plan to support a light form of vhosting (virtual host) (multiple homeserver tenants in a single shard), we're currently diving into the details and implications of running multiple instances of Synapse in the same Python process. Currently, metrics are combined across all Synapse instances because the Synapse codebase uses the default global REGISTRY provided by the Prometheus client.
"Per-tenant metrics" tracked internally by https://github.com/element-hq/synapse-small-hosts/issues/5
Potential solutions
(as discussed below)
Homeserver-specific CollectorRegistry (specify registry)
Prometheus has a concept of metric CollectorRegistry's and all of the metrics from the client library already support specifying the registry. We could refactor things to point to our homeserver-specific registry instead of the global REGISTRY which is the default. This is also the mechanism that naturally maps to the problem of metrics from different hosts (registries) as that's what would happen if the servers were separate.
But this still results in instance labels getting added to each metric when scraped: "When Prometheus scrapes a target, it attaches some labels automatically to the scraped time series which serve to identify the scraped target: [job, instance]" (source) which means we will get instance labels with the same amount of cardinality as if we just added them ourselves.
Add server name instance label to the metrics
Adding the instance label (server name) to the metric itself has the benefit that we can distinguish between metrics that apply on the Python process level vs the per-homeserver-tenant level. We don't need to label the the per-process metrics that can only really be measured at the Python process level (process_*, python_*) like CPU usage, Python garbage collection, Twisted reactor tick time. Which means we don't need waste space on storing duplicate metrics or have a misleading instance label for a homeserver tenant but it really applies to all tenants in the shard process.
Because instance has special meaning according to the Prometheus docs and should be "The <host>:<port> part of the target's URL that was scraped." (source) (also: "In Prometheus terms, an endpoint you can scrape is called an instance, usually corresponding to a single process.") which is different from how we're using it here; we're using our own custom server_name label.
The docs mention it's fine to provide (no longer relevant since we're using our own instance label in the scraped data and we just need to have honor_labels: true configured in the Prometheus scrape config.server_name label)
We also have a simple scrape story: The best setup seems to be to use a single scrape for the whole shard (consisting of multiple homeserver tenants) and continue to use the global REGISTRY in Synapse. All of the per-tenant data will have the appropriate instance label as necessary and per-process data won't be duplicated because it's a single scrape.
Either method will be a similar amount of bulk refactoring as we need to specify the registry or an extra label either way.
Plan forward
Add server_name labels to all homeserver-scoped metrics.
We will not be changing the the per-process metrics that can only really be measured at the Python process level (process_*, python_*) like CPU usage, Python garbage collection, Twisted reactor tick time.
This issue serves as a central place to track work remaining as we refactor the codebase.
- Refactor built-in Prometheus metrics to use homeserver-scoped registry
-
Counter-> RefactorCountermetrics to be homeserver-scoped #18656, Refactor background process metrics to be homeserver-scoped #18670 -
Histogram-> RefactorHistogrammetrics to be homeserver-scoped #18724 -
Gauge-> RefactorGaugemetrics to be homeserver-scoped #18725 -
Summary-> Update metrics linting to be able to handle custom metrics #18733 -
Info-> Update metrics linting to be able to handle custom metrics #18733 -
Enum-> Update metrics linting to be able to handle custom metrics #18733
-
- Refactor our custom Prometheus
Collectormetrics to use homeserver-scoped registry-
LaterGauge-> RefactorLaterGaugemetrics to be homeserver-scoped #18714 -
InFlightGauge-> RefactorMeasureblock metrics to be homeserver-scoped (v2) #18601, Update metrics linting to be able to handle custom metrics #18733 -
GaugeBucketCollector-> RefactorGaugeBucketCollectormetrics to be homeserver-scoped #18715 -
CPUMetrics(this is a per-process metric) -> Update metrics linting to be able to handle custom metrics #18733 -
GCCounts(this is a per-process metric) -> Update metrics linting to be able to handle custom metrics #18733 -
PyPyGCStats(this is a per-process metric) -> Update metrics linting to be able to handle custom metrics #18733 -
ReactorLastSeenMetric(this is a per-process metric) -> Update metrics linting to be able to handle custom metrics #18733 - Background process
_Collector-> Refactor background process metrics to be homeserver-scoped #18670 -
JemallocCollector(this is a per-process metric) -> Update metrics linting to be able to handle custom metrics #18733 -
DynamicCollectorRegistry-> Update metrics linting to be able to handle custom metrics #18733, Cleanly shutdown SynapseHomeServer object #18828
-
- Refactor wrappers around metrics:
- Cache metrics:
LruCache/@cached,CacheMetric-> Refactor cache metrics to be homeserver-scoped #18604, Cleanly shutdown SynapseHomeServer object #18828 -
Measure->RefactorRefactorMeasureblock metrics to be homeserver-scoped #18591Measureblock metrics to be homeserver-scoped (v2) #18601
- Cache metrics:
- Conflicting metrics that already use the
server_namelabel: - Make sure we lint custom
Metricto ensure theSERVER_NAME_LABELis included -> Update metrics linting to be able to handle custom metrics #18733-
UnknownMetricFamily -
CounterMetricFamily -
GaugeMetricFamily -
SummaryMetricFamily -
InfoMetricFamily -
HistogramMetricFamily -
GaugeHistogramMetricFamily -
StateSetMetricFamily - Our own
GaugeHistogramMetricFamilyWithLabels(introduced in RefactorGaugeBucketCollectormetrics to be homeserver-scoped #18715)
-
-
Add linting to prevent usage of(no longer relevant because we moved to addingREGISTRYinstancelabels to the metrics) -
Add linting to prevent new metrics being used that use the default(no longer relevant because we moved to addingREGISTRYinstancelabels to the metrics) - Add linting to ensure metrics have a
SERVER_NAME_LABELlabel added. Per-process metrics should just use a lint ignore comment with the reasoning why. -> RefactorCountermetrics to be homeserver-scoped #18656 -
Add linting to ensure metric reporting includes(No need as if you pass in the wrong number of labels, thenSERVER_NAME_LABELvalue. Per-process metrics should just use a lint ignore comment with the reasoning why.prometheus_clientwill raise aValueErrorat runtime. Ideally, we'd be able to catch this at type-checking time but this is good enough.)