Skip to content

Implement otel retry metrics #12064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

AgraVator
Copy link
Contributor

@AgraVator AgraVator commented May 14, 2025

implements A96

@@ -64,8 +65,8 @@ public Stopwatch get() {
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the gRFC has had recent updates, waiting for it to get merged before doing further changes...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there had been discussions yesterday about the recent changes to the retry metrics and whether they helped and such. The implementation should be mostly the same, but it isn't probably worth chasing the gRFC for the moment.

@AgraVator AgraVator force-pushed the implement-otel-retry-metric branch from a7a16ce to 3529769 Compare May 27, 2025 07:02
@AgraVator AgraVator force-pushed the implement-otel-retry-metric branch from 3529769 to bd69ed5 Compare June 24, 2025 11:38
Optional<MetricData> metric = openTelemetryTesting.getMetrics().stream()
.filter(m -> m.getName().equals(metricName))
.findFirst();
if (metric.isPresent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will this metric not be present for the operations done by the tested on the tracer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever they are not enabled or have default values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is test code, it is risky to have the assertion inside an if statement. If the absence of the metric is not expected, then it should not cause the test to pass.

@AgraVator AgraVator requested a review from kannanjgithub July 14, 2025 06:58
Optional<MetricData> metric = openTelemetryTesting.getMetrics().stream()
.filter(m -> m.getName().equals(metricName))
.findFirst();
if (metric.isPresent()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever they are not enabled or have default values

Comment on lines +352 to 357
if (info.isTransparentRetry()) {
transparentRetriesPerCall.incrementAndGet();
} else if (info.isHedging()) {
hedgedAttemptsPerCall.incrementAndGet();
} else {
attemptsPerCall.incrementAndGet();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not assume them to be mutually exclusive ?

@@ -266,7 +266,8 @@ public ClientStreamTracer newClientStreamTracer(

Metadata newHeaders = updateHeaders(headers, previousAttemptCount);
// NOTICE: This set _must_ be done before stream.start() and it actually is.
sub.stream = newSubstream(newHeaders, tracerFactory, previousAttemptCount, isTransparentRetry);
sub.stream = newSubstream(newHeaders, tracerFactory, previousAttemptCount, isTransparentRetry,
isHedging);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isHedging is just whether there is a hedging policy. It doesn't mean this particular stream is a hedge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, are you suggesting to change the var name to better reflect the same ?

@@ -71,6 +71,7 @@
*/
final class OpenTelemetryMetricsModule {
private static final Logger logger = Logger.getLogger(OpenTelemetryMetricsModule.class.getName());
private static final double NANOS_PER_SEC = 1_000_000_000.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SECONDS_PER_NANO already exists. Use multiplication instead of division.


// Retry counts
if (module.resource.clientCallRetriesCounter() != null) {
long retriesPerCall = attemptsPerCall.get() - 1 >= 0 ? attemptsPerCall.get() - 1 : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math.max(attemptsPerCall.get() - 1, 0) ? Note that this currently is a bad idea, as it reads attemptsPerCall twice. (If it races, you can reason out that it is okay, but it is much better to avoid the race entirely by reading only once.)


// Hedge counts
if (module.resource.clientCallHedgesCounter() != null) {
if (hedgedAttemptsPerCall.get() > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read atomics (volatile and Atomic*) into a local variable instead of reading them multiple times.

@@ -331,6 +335,7 @@ public ClientStreamTracer newClientStreamTracer(StreamInfo info, Metadata metada
}
if (++activeStreams == 1 && attemptStopwatch.isRunning()) {
attemptStopwatch.stop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attemptStopwatch looks to be for grpc.client.attempt.duration? This looks busted (pre-existing). It looks like attemptStopwatch is tracking the retry_delay right now, as it only runs when there are no streams. Each stream needs its own attempt stopwatch, right? But right now there is only one per call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CallAttemptsTracerFactory.attemptEnded is for recording a stream ended (StatsTraceContext.streamClosed calls it), so for the first attempt and each retry attempt end the attemptStopWatch is started, so it does measure the time between stream attempts, not per call.

But there is a problem. RetriableStream creates its own anonymous tracerFactory and not take the one put by MetricsClientInterceptor in the callOptions, so this needs to change.

I have a question. A call can only have 1 stream, and the only way the activeStreams can be > 1 is if the tracer factory is shared between calls (and that would mess up the calculation for time between stream attempts anyway). The way the OpenTelemetryMetricsModule.MetricsClientInterceptor creates the factory though is per call, not shared between calls. So we we can never have > 1 streams and there should be no need for synchronization either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I said about RetriableStream's anonymous tracerFactory is not a problem, StatsTraceContext.streamClosed iterates over all tracers and calls stream closed on each of them.

@@ -331,6 +335,7 @@ public ClientStreamTracer newClientStreamTracer(StreamInfo info, Metadata metada
}
if (++activeStreams == 1 && attemptStopwatch.isRunning()) {
attemptStopwatch.stop();
retryDelayNanos = attemptStopwatch.elapsed(TimeUnit.NANOSECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell how retryDelayNanos is synchronized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants