Skip to content

Refactor metrics to be scoped to the homeserver #18592

@MadLittleMods

Description

@MadLittleMods

Background

As part of Element's plan to support a light form of vhosting (virtual host) (multiple homeserver tenants in a single shard), we're currently diving into the details and implications of running multiple instances of Synapse in the same Python process. Currently, metrics are combined across all Synapse instances because the Synapse codebase uses the default global REGISTRY provided by the Prometheus client.

"Per-tenant metrics" tracked internally by https://github.com/element-hq/synapse-small-hosts/issues/5

Potential solutions

(as discussed below)

Homeserver-specific CollectorRegistry (specify registry)

Prometheus has a concept of metric CollectorRegistry's and all of the metrics from the client library already support specifying the registry. We could refactor things to point to our homeserver-specific registry instead of the global REGISTRY which is the default. This is also the mechanism that naturally maps to the problem of metrics from different hosts (registries) as that's what would happen if the servers were separate.

But this still results in instance labels getting added to each metric when scraped: "When Prometheus scrapes a target, it attaches some labels automatically to the scraped time series which serve to identify the scraped target: [job, instance]" (source) which means we will get instance labels with the same amount of cardinality as if we just added them ourselves.

Add server name instance label to the metrics

Adding the instance label (server name) to the metric itself has the benefit that we can distinguish between metrics that apply on the Python process level vs the per-homeserver-tenant level. We don't need to label the the per-process metrics that can only really be measured at the Python process level (process_*, python_*) like CPU usage, Python garbage collection, Twisted reactor tick time. Which means we don't need waste space on storing duplicate metrics or have a misleading instance label for a homeserver tenant but it really applies to all tenants in the shard process.

Because instance has special meaning according to the Prometheus docs and should be "The <host>:<port> part of the target's URL that was scraped." (source) (also: "In Prometheus terms, an endpoint you can scrape is called an instance, usually corresponding to a single process.") which is different from how we're using it here; we're using our own custom server_name label.

The docs mention it's fine to provide instance label in the scraped data and we just need to have honor_labels: true configured in the Prometheus scrape config. (no longer relevant since we're using our own server_name label)

We also have a simple scrape story: The best setup seems to be to use a single scrape for the whole shard (consisting of multiple homeserver tenants) and continue to use the global REGISTRY in Synapse. All of the per-tenant data will have the appropriate instance label as necessary and per-process data won't be duplicated because it's a single scrape.

Either method will be a similar amount of bulk refactoring as we need to specify the registry or an extra label either way.

Plan forward

Add server_name labels to all homeserver-scoped metrics.

We will not be changing the the per-process metrics that can only really be measured at the Python process level (process_*, python_*) like CPU usage, Python garbage collection, Twisted reactor tick time.

This issue serves as a central place to track work remaining as we refactor the codebase.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions