Refactor metrics to be scoped to the homeserver

### Background

As part of Element's plan to support a light form of vhosting (virtual host) (multiple homeserver tenants in a single shard), we're currently diving into the details and implications of running multiple instances of Synapse in the same Python process. Currently, metrics are combined across all Synapse instances because the Synapse codebase uses the default global `REGISTRY` provided by the Prometheus client.

"Per-tenant metrics" tracked internally by https://github.com/element-hq/synapse-small-hosts/issues/5

### Potential solutions

*(as discussed below)*

#### Homeserver-specific `CollectorRegistry` (specify `registry`)

Prometheus has a concept of metric `CollectorRegistry`'s and all of the metrics from the client library already support specifying the `registry`. We could refactor things to point to our homeserver-specific registry instead of the global `REGISTRY` which is the default. This is also the mechanism that naturally maps to the problem of metrics from different hosts (registries) as that's what would happen if the servers were separate.

But this still results in `instance` labels getting added to each metric when scraped: "When Prometheus scrapes a target, it attaches some labels automatically to the scraped time series which serve to identify the scraped target: [`job`, `instance`]" ([source](https://prometheus.io/docs/concepts/jobs_instances/)) which means we will get `instance` labels with the same amount of cardinality as if we just added them ourselves.

#### Add server name `instance` label to the metrics

Adding the `instance` label (server name) to the metric itself has the benefit that we can distinguish between metrics that apply on the Python process level vs the per-homeserver-tenant level. We don't need to label the the per-process metrics that can only really be measured at the Python process level (`process_*`, `python_*`) like CPU usage, Python garbage collection, Twisted reactor tick time. Which means we don't need waste space on storing duplicate metrics or have a misleading `instance` label for a homeserver tenant but it really applies to all tenants in the shard process.

Because `instance` has special meaning according to the Prometheus docs and should be "The `<host>:<port>` part of the target's URL that was scraped." ([source](https://prometheus.io/docs/concepts/jobs_instances/)) (also: "In Prometheus terms, an endpoint you can scrape is called an `instance`, usually corresponding to a single process.") which is different from how we're using it here; we're using our own custom `server_name` label.

~~The [docs](https://prometheus.io/docs/concepts/jobs_instances/) mention it's fine to provide `instance` label in the scraped data and we just need to have [`honor_labels: true`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) configured in the Prometheus scrape config.~~ (no longer relevant since we're using our own `server_name` label)

We also have a simple scrape story:  The best setup seems to be to use a single scrape for the whole shard (consisting of multiple homeserver tenants) and continue to use the global `REGISTRY` in Synapse. All of the per-tenant data will have the appropriate `instance` label as necessary and per-process data won't be duplicated because it's a single scrape.

Either method will be a similar amount of bulk refactoring as we need to specify the `registry` or an extra `label` either way.


### Plan forward

Add `server_name` labels to all homeserver-scoped metrics.

We will not be changing the the per-process metrics that can only really be measured at the Python process level (`process_*`, `python_*`) like CPU usage, Python garbage collection, Twisted reactor tick time.

This issue serves as a central place to track work remaining as we refactor the codebase.

 - [x] Refactor built-in Prometheus metrics to use homeserver-scoped registry
     - [x] `Counter` -> https://github.com/element-hq/synapse/pull/18656, https://github.com/element-hq/synapse/pull/18670
     - [x] `Histogram` -> https://github.com/element-hq/synapse/pull/18724
     - [x] `Gauge` -> https://github.com/element-hq/synapse/pull/18725
     - [x] `Summary` -> https://github.com/element-hq/synapse/pull/18733
     - [x] `Info` -> https://github.com/element-hq/synapse/pull/18733
     - [x] `Enum` -> https://github.com/element-hq/synapse/pull/18733
 - [x] Refactor our custom Prometheus `Collector` metrics to use homeserver-scoped registry
     - [x] `LaterGauge` -> https://github.com/element-hq/synapse/pull/18714
     - [x] `InFlightGauge` -> https://github.com/element-hq/synapse/pull/18601, https://github.com/element-hq/synapse/pull/18733
     - [x] `GaugeBucketCollector` -> https://github.com/element-hq/synapse/pull/18715
     - [x] `CPUMetrics` (this is a per-process metric)  -> https://github.com/element-hq/synapse/pull/18733
     - [x] `GCCounts` (this is a per-process metric) -> https://github.com/element-hq/synapse/pull/18733
     - [x] `PyPyGCStats`  (this is a per-process metric) -> https://github.com/element-hq/synapse/pull/18733
     - [x] `ReactorLastSeenMetric` (this is a per-process metric) -> https://github.com/element-hq/synapse/pull/18733
     - [x] Background process `_Collector` -> https://github.com/element-hq/synapse/pull/18670
     - [x] `JemallocCollector` (this is a per-process metric) -> https://github.com/element-hq/synapse/pull/18733
     - [x] `DynamicCollectorRegistry` -> https://github.com/element-hq/synapse/pull/18733, https://github.com/element-hq/synapse/pull/18828
 - [x] Refactor wrappers around metrics:
     - [x] Cache metrics: `LruCache`/`@cached`, `CacheMetric` -> https://github.com/element-hq/synapse/pull/18604, https://github.com/element-hq/synapse/pull/18828
     - [x] `Measure` -> ~~https://github.com/element-hq/synapse/pull/18591~~ https://github.com/element-hq/synapse/pull/18601
 - [x] Conflicting metrics that already use the `server_name` label:
     - [x] [`synapse_federation_last_received_pdu_time`](https://github.com/element-hq/synapse/blob/b7e7f537f1119c5a7cc3347e67b6e1351cb5deea/synapse/federation/federation_server.py#L120-L124) -> https://github.com/element-hq/synapse/pull/18725
     - [x] [`synapse_federation_last_sent_pdu_time`](https://github.com/element-hq/synapse/blob/b7e7f537f1119c5a7cc3347e67b6e1351cb5deea/synapse/federation/sender/transaction_manager.py#L47-L51) -> https://github.com/element-hq/synapse/pull/18725
 - [x] Make sure we lint custom `Metric` to ensure the `SERVER_NAME_LABEL` is included -> https://github.com/element-hq/synapse/pull/18733
     - [x] `UnknownMetricFamily`
     - [x] `CounterMetricFamily`
     - [x] `GaugeMetricFamily`
     - [x] `SummaryMetricFamily`
     - [x] `InfoMetricFamily`
     - [x] `HistogramMetricFamily`
     - [x] `GaugeHistogramMetricFamily`
     - [x] `StateSetMetricFamily`
     - [x] Our own `GaugeHistogramMetricFamilyWithLabels` (introduced in https://github.com/element-hq/synapse/pull/18715)
 - [ ] ~~Add linting to prevent usage of `REGISTRY`~~ (no longer relevant because we moved to adding `instance` labels to the metrics)
 - [ ] ~~Add linting to prevent new metrics being used that use the default `REGISTRY`~~ (no longer relevant because we moved to adding `instance` labels to the metrics)
 - [x] Add linting to ensure metrics have a `SERVER_NAME_LABEL` label added. Per-process metrics should just use a lint ignore comment with the reasoning why. -> https://github.com/element-hq/synapse/pull/18656
 - [ ] ~~Add linting to ensure metric reporting includes `SERVER_NAME_LABEL` value. Per-process metrics should just use a lint ignore comment with the reasoning why.~~ (No need as if you pass in the wrong number of labels, then `prometheus_client` will raise a [`ValueError` at runtime](https://github.com/prometheus/client_python/blob/6f19d31e30c2f8bb44afe953ead19a1de1592367/prometheus_client/metrics.py#L134-L187). Ideally, we'd be able to catch this at type-checking time but this is good enough.)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor metrics to be scoped to the homeserver #18592

Background

Potential solutions

Homeserver-specific `CollectorRegistry` (specify `registry`)

Add server name `instance` label to the metrics

Plan forward

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor metrics to be scoped to the homeserver #18592

Description

Background

Potential solutions

Homeserver-specific CollectorRegistry (specify registry)

Add server name instance label to the metrics

Plan forward

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Homeserver-specific `CollectorRegistry` (specify `registry`)

Add server name `instance` label to the metrics