Skip to content

Commit 292de67

Browse files
authored
A94: OTel metrics for Subchannels (#485)
* A94: gRPC OTel metrics for Subchannels * Add discusion thread * Reviewer comments * Add updated by tag to A78 * Add note on stability * Formatting * Add windows error code for connection aborted * Add security level label * Reviewer comments * Reviewer comments * Reviewer comment * Fix github id * Fix link * Reviewer comment * Reviewer comments * Reviewer comment * Move status to ready for implementation
1 parent 343b4d8 commit 292de67

File tree

2 files changed

+154
-3
lines changed

2 files changed

+154
-3
lines changed

A78-grpc-metrics-wrr-pf-xds.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient
44
* Approver: @ejona86, @dfawley
55
* Status: {Draft, In Review, Ready for Implementation, Implemented}
66
* Implemented in: <language, ...>
7-
* Last updated: 2024-09-24
7+
* Last updated: 2025-07-01
88
* Discussion at: https://groups.google.com/g/grpc-io/c/A2Mqz8OMDys
9-
* Updated by: [A88: xDS Data Error Handling](A88-xds-data-error-handling.md)
9+
* Updated by: [A88: xDS Data Error Handling](A88-xds-data-error-handling.md), [A94: OTel metrics for Subchannels](A94-subchannel-otel-metrics.md)
1010

1111
## Abstract
1212

@@ -103,7 +103,7 @@ The following metrics will be exported:
103103
| grpc.lb.wrr.endpoint_weight_stale | Counter | {endpoint} | grpc.target, grpc.lb.locality | Number of endpoints from each scheduler update whose latest weight is older than the expiration period. |
104104
| grpc.lb.wrr.endpoint_weights | Histogram | {weight} | grpc.target, grpc.lb.locality | Weight of each endpoint, recorded on every scheduler update. Endpoints without usable weights will be recorded as weight 0. |
105105

106-
### Pick First LB Policy
106+
### [Outdated] Pick First LB Policy (Updated by [A94](A94-subchannel-otel-metrics.md))
107107

108108
The Pick First LB policy predates the gRFC process but was updated in
109109
[A62]. We propose to add the following metrics to it.

A94-subchannel-otel-metrics.md

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
## A94: OTel metrics for Subchannels
2+
3+
* Author(s): Yash Tibrewal (@yashykt)
4+
* Approver: Mark Roth (@markdroth), Eric Anderson (@ejona86), Doug Fawley
5+
(@dfawley)
6+
* Status: Ready for Implementation
7+
* Implemented in:
8+
* Last updated: 2025-08-12
9+
* Discussion at: https://groups.google.com/g/grpc-io/c/iMdK7r4E5tU
10+
11+
## Abstract
12+
13+
Introduce OpenTelemetry metrics for subchannels. These metrics will replace the
14+
existing pick-first metrics.
15+
16+
## Background
17+
18+
In [A78], metrics for PickFirst load-balancing policy were proposed that provide
19+
observability on disconnections for subchannels and connection attempts made for
20+
those subchannels. These metrics do not currently contain information on the
21+
reason for disconnection, the xds locality or the cluster information.
22+
23+
[A89] is a proposal to introduce a new optional label `grpc.lb.backend_service`
24+
to client-side per-attempt metrics. This label has xds cluster information.
25+
26+
### Related Proposals:
27+
28+
* [A8]: Client-side Keepalive
29+
* [A18]: TCP User Timeout
30+
* [A61]: IPv4 and IPv6 Dualstack Backend Support
31+
* [A66]: OpenTelemetry Metrics
32+
* [A74]: xDS Config Tears
33+
* [A78]: gRPC OTel Metrics for WRR, Pick First, and XdsClient
34+
* [A79]: Non-per-call Metrics Architecture
35+
* [A89]: Backend Service Metric Label
36+
* [L62]: gRPC security level negotiation between call credentials and channels
37+
38+
[A8]: A8-client-side-keepalive.md
39+
[A18]: A18-tcp-user-timeout.md
40+
[A61]: A61-IPv4-IPv6-dualstack-backends.md
41+
[A66]: A66-otel-stats.md
42+
[A74]: A74-xds-config-tears.md
43+
[A78]: A78-grpc-metrics-wrr-pf-xds.md
44+
[A79]: A79-non-per-call-metrics-architecture.md
45+
[A89]: A89-backend-service-metric-label.md
46+
[L62]: L62-core-call-credential-security-level.md
47+
48+
## Proposal
49+
50+
Move the existing pick-first metrics to subchannel metrics
51+
(`grpc.lb.pick_first.*` to `grpc.subchannel.*`) with the addition of optional
52+
labels as shown below -
53+
54+
Metric Name | Type | Unit | Labels | Description
55+
------------------------------------------------------------------------------------------------------ | -------------- | --------------- | -------------------------------------------------------------------------------------------------------------- | -----------
56+
grpc.subchannel.disconnections (Old - grpc.lb.pick_first.disconnections) | Counter | {disconnection} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional), grpc.disconnect_error (optional) | Number of times the selected subchannel becomes disconnected.
57+
grpc.subchannel.connection_attempts_succeeded (Old - grpc.lb.pick_first.connection_attempts_succeeded) | Counter | {attempt} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of successful connection attempts.
58+
grpc.subchannel.connection_attempts_failed (Old - grpc.lb.pick_first.connection_attempts_failed) | Counter | {attempt} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of failed connection attempts.
59+
grpc.subchannel.open_connections | UpDown Counter | {connection} | grpc.target, grpc.security_level (optional), grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of open connections.
60+
61+
If we end up discarding connection attempts as we do with the “happy eyeballs”
62+
algorithm (as per [A61]), we should not record the connection attempt or the
63+
disconnection.
64+
65+
Implementations that have already implemented the pick-first metrics should give
66+
enough time for users to transition to the new metrics. For example,
67+
implementations should report both the old pick-first metrics and the new
68+
subchannel metrics for 2 releases, and then remove the old pick-first metrics.
69+
70+
Label Name | Disposition | Description
71+
----------------------- | ----------- | -----------
72+
grpc.target | Required | Indicates the target of the gRPC channel (defined in [A66].)
73+
grpc.lb.backend_service | Optional | The backend service to which the RPC was routed (defined in [A89].)
74+
grpc.lb.locality | Optional | The locality to which the traffic is being sent. This will be set to the resolver attribute passed down from the weighted_target policy, or the empty string if the resolver attribute is unset (defined in [A78].)
75+
grpc.disconnect_error | Optional | Reason for disconnection.
76+
grpc.security_level | Optional | Denotes the security level of the connection. Allowed values - "none", "integrity_only" and "privacy_and_integrity".
77+
78+
The subchannel needs to be passed attributes with the values for the
79+
`grpc.lb.backend_service` and `grpc.lb.locality` labels (defined in [A89] and
80+
[A78] respectively). This implies that the subchannel will be recreated when
81+
these attributes change. Since currently, only xDS is using these labels, the
82+
attributes will be set for each endpoint or address by cds (post-[A74]) or
83+
xds_cluster_resolver (pre-[A74]) LB policies.
84+
85+
List of allowed values for `grpc.disconnect_error` -
86+
87+
Error string | Description
88+
-------------------- | -----------
89+
GOAWAY <ERROR_CODE> | HTTP2 GOAWAY frame with error code for example (“GOAWAY NO_ERROR”, “GOAWAY PROTOCOL_ERROR”, “GOAWAY ENHANCE_YOUR_CALM”). The list of error codes is available in [RFC 9113](https://www.rfc-editor.org/rfc/rfc9113.html#name-error-codes).
90+
subchannel shutdown | The subchannel was shutdown. This can happen due to reasons such as the parent channel shutting down, channel becoming idle, the load balancing policy changing due to a resolver update, or a change in list of endpoint addresses.
91+
connection reset | Connection was reset (eg. ECONNRESET, WSAECONNERESET.)
92+
connection timed out | Connection timed out (eg. ETIMEDOUT, WSAETIMEDOUT), also includes connections closed due to [A8]: gRPC keepalives.
93+
connection aborted | Connection was aborted (eg. ECONNABORTED, WSAECONNABORTED.)
94+
socket error | Any socket error not covered by “connection reset”, “connection timed out” and “connection aborted”. Implementations that are not able to differentiate between the different socket error codes should also use this.
95+
unknown | Catch-all for all other reasons.
96+
97+
For a given connection, there can be multiple reasons reported to the subchannel
98+
for disconnection. For example, a connection could have seen a GOAWAY frame with
99+
`ENHANCE_YOUR_CALM` and then a socket error Broken Pipe. In such cases, the
100+
first seen reason should be chosen, `GOAWAY ENHANCE_YOUR_CALM` in this case.
101+
102+
We might add more error cases to this in the future.
103+
104+
### Stability
105+
106+
As recommended by [A79], these metrics will start off as experimental, and hence
107+
off-by-default. The decision on whether these metrics will be on-by-default or
108+
off-by-default on de-experimentalization will be made at the same time as the
109+
de-experimentalization.
110+
111+
## Rationale
112+
113+
### Renaming pick-first metrics
114+
115+
The existing pick-first metrics provides stats on subchannel disconnections and
116+
connection attempts as viewed from the perspective of the pick-first lb policy.
117+
[A61] made pick-first lb policy the universal leaf policy. For users unfamiliar
118+
with this, it will come as a surprise when metrics for pick-first lb policy are
119+
populated when round_robin lb policy is configured (for example). Additionally,
120+
the pick-first metrics are defined from the perspective of the channel. This
121+
means that if subchannels are shared between multiple channels (as is the case
122+
for gRPC Core and its wrapped languages - C++, Python), we will double-count the
123+
disconnections/connection attempts.
124+
125+
Renaming/moving the pick-first metrics to subchannel makes this more intuitive,
126+
and fixes the double-counting problem.
127+
128+
### Metric for open connections
129+
130+
Moving the metrics down to subchannel potentially allows us to calculate the
131+
number of open connections by subtracting `grpc.subchannel.disconnections` from
132+
`grpc.subchannel.connection_attempts_succeeded`. This method does not work for
133+
exporters recording counters per period in a way that does not allow for a
134+
simple subtraction of the two counters
135+
(https://github.com/grpc/grpc/issues/34886).
136+
137+
Adding an explicit metric that records the number of open connections avoids
138+
this.
139+
140+
### Combining connection timeouts and keepalives into a single disconnection error
141+
142+
We expect most implementations of [A8] to also set the POSIX socket option
143+
`TCP_USER_TIMEOUT` with the same timeout value as stated in [A18]. As such, in
144+
cases where the connection is broken, the keepalive timeout will race with
145+
sockets being closed due to `TCP_USER_TIMEOUT`. Since the motive of the two
146+
timers is essentially the same, we choose to combine them into a single error,
147+
instead of trying to differentiate between them.
148+
149+
## Implementation
150+
151+
TBD

0 commit comments

Comments
 (0)