Implement otel retry metrics #12064

AgraVator · 2025-05-14T15:45:50Z

implements A96

opentelemetry/src/main/java/io/grpc/opentelemetry/internal/OpenTelemetryConstants.java

opentelemetry/src/main/java/io/grpc/opentelemetry/GrpcOpenTelemetry.java

kannanjgithub · 2025-05-19T12:12:53Z

opentelemetry/src/main/java/io/grpc/opentelemetry/GrpcOpenTelemetry.java

@@ -64,8 +65,8 @@ public Stopwatch get() {
  };



Add unit tests.

the gRFC has had recent updates, waiting for it to get merged before doing further changes...

Yeah, there had been discussions yesterday about the recent changes to the retry metrics and whether they helped and such. The implementation should be mostly the same, but it isn't probably worth chasing the gRFC for the moment.

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java

kannanjgithub · 2025-06-25T13:17:17Z

opentelemetry/src/test/java/io/grpc/opentelemetry/OpenTelemetryMetricsModuleTest.java

+      Optional<MetricData> metric = openTelemetryTesting.getMetrics().stream()
+          .filter(m -> m.getName().equals(metricName))
+          .findFirst();
+      if (metric.isPresent()) {


When will this metric not be present for the operations done by the tested on the tracer?

Whenever they are not enabled or have default values

As this is test code, it is risky to have the assertion inside an if statement. If the absence of the metric is not expected, then it should not cause the test to pass.

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java

AgraVator · 2025-06-30T03:59:16Z

opentelemetry/src/test/java/io/grpc/opentelemetry/OpenTelemetryMetricsModuleTest.java

+      Optional<MetricData> metric = openTelemetryTesting.getMetrics().stream()
+          .filter(m -> m.getName().equals(metricName))
+          .findFirst();
+      if (metric.isPresent()) {


Whenever they are not enabled or have default values

AgraVator · 2025-07-14T06:31:58Z

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java

+      if (info.isTransparentRetry()) {
+        transparentRetriesPerCall.incrementAndGet();
+      } else if (info.isHedging()) {
+        hedgedAttemptsPerCall.incrementAndGet();
+      } else {
        attemptsPerCall.incrementAndGet();


Should this not assume them to be mutually exclusive ?

ejona86 · 2025-07-28T16:55:22Z

core/src/main/java/io/grpc/internal/RetriableStream.java

@@ -266,7 +266,8 @@ public ClientStreamTracer newClientStreamTracer(

    Metadata newHeaders = updateHeaders(headers, previousAttemptCount);
    // NOTICE: This set _must_ be done before stream.start() and it actually is.
-    sub.stream = newSubstream(newHeaders, tracerFactory, previousAttemptCount, isTransparentRetry);
+    sub.stream = newSubstream(newHeaders, tracerFactory, previousAttemptCount, isTransparentRetry,
+        isHedging);


isHedging is just whether there is a hedging policy. It doesn't mean this particular stream is a hedge.

So, are you suggesting to change the var name to better reflect the same ?

ejona86 · 2025-07-28T17:03:01Z

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java

@@ -71,6 +71,7 @@
 */
 final class OpenTelemetryMetricsModule {
  private static final Logger logger = Logger.getLogger(OpenTelemetryMetricsModule.class.getName());
+  private static final double NANOS_PER_SEC = 1_000_000_000.0;


SECONDS_PER_NANO already exists. Use multiplication instead of division.

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java

ejona86 · 2025-07-28T17:08:44Z

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java

+
+      // Retry counts
+      if (module.resource.clientCallRetriesCounter() != null) {
+        long retriesPerCall = attemptsPerCall.get() - 1 >= 0 ? attemptsPerCall.get() - 1 : 0;


Math.max(attemptsPerCall.get() - 1, 0) ? Note that this currently is a bad idea, as it reads attemptsPerCall twice. (If it races, you can reason out that it is okay, but it is much better to avoid the race entirely by reading only once.)

ejona86 · 2025-07-28T17:10:58Z

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java

+
+      // Hedge counts
+      if (module.resource.clientCallHedgesCounter() != null) {
+        if (hedgedAttemptsPerCall.get() > 0) {


Read atomics (volatile and Atomic*) into a local variable instead of reading them multiple times.

ejona86 · 2025-07-28T17:29:19Z

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java

@@ -331,6 +335,7 @@ public ClientStreamTracer newClientStreamTracer(StreamInfo info, Metadata metada
        }
        if (++activeStreams == 1 && attemptStopwatch.isRunning()) {
          attemptStopwatch.stop();


This attemptStopwatch looks to be for grpc.client.attempt.duration? This looks busted (pre-existing). It looks like attemptStopwatch is tracking the retry_delay right now, as it only runs when there are no streams. Each stream needs its own attempt stopwatch, right? But right now there is only one per call.

CallAttemptsTracerFactory.attemptEnded is for recording a stream ended (StatsTraceContext.streamClosed calls it), so for the first attempt and each retry attempt end the attemptStopWatch is started, so it does measure the time between stream attempts, not per call.

But there is a problem. RetriableStream creates its own anonymous tracerFactory and not take the one put by MetricsClientInterceptor in the callOptions, so this needs to change.

I have a question. A call can only have 1 stream, and the only way the activeStreams can be > 1 is if the tracer factory is shared between calls (and that would mess up the calculation for time between stream attempts anyway). The way the OpenTelemetryMetricsModule.MetricsClientInterceptor creates the factory though is per call, not shared between calls. So we we can never have > 1 streams and there should be no need for synchronization either.

What I said about RetriableStream's anonymous tracerFactory is not a problem, StatsTraceContext.streamClosed iterates over all tracers and calls stream closed on each of them.

ejona86 · 2025-07-28T17:31:29Z

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java

@@ -331,6 +335,7 @@ public ClientStreamTracer newClientStreamTracer(StreamInfo info, Metadata metada
        }
        if (++activeStreams == 1 && attemptStopwatch.isRunning()) {
          attemptStopwatch.stop();
+          retryDelayNanos = attemptStopwatch.elapsed(TimeUnit.NANOSECONDS);


I can't tell how retryDelayNanos is synchronized.

AgraVator requested review from ejona86 and kannanjgithub May 14, 2025 15:45

kannanjgithub reviewed May 19, 2025

View reviewed changes

opentelemetry/src/main/java/io/grpc/opentelemetry/internal/OpenTelemetryConstants.java Outdated Show resolved Hide resolved

kannanjgithub reviewed May 19, 2025

View reviewed changes

opentelemetry/src/main/java/io/grpc/opentelemetry/GrpcOpenTelemetry.java Show resolved Hide resolved

kannanjgithub reviewed May 19, 2025

View reviewed changes

AgraVator force-pushed the implement-otel-retry-metric branch from a7a16ce to 3529769 Compare May 27, 2025 07:02

AgraVator added 4 commits June 24, 2025 13:37

otel: implement retry metrics

c3b473a

otel: add null checks for retry instrumentation

6c6a0f5

otel: update the retry stats as per gRFC A96

f9c5a68

otel: adds test cases for retry metrics

bd69ed5

AgraVator force-pushed the implement-otel-retry-metric branch from 3529769 to bd69ed5 Compare June 24, 2025 11:38

AgraVator commented Jun 24, 2025

View reviewed changes

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java Outdated Show resolved Hide resolved

fix: formatting fixes

fab9a26

kannanjgithub reviewed Jun 25, 2025

View reviewed changes

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java Show resolved Hide resolved

kannanjgithub reviewed Jun 25, 2025

View reviewed changes

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java Outdated Show resolved Hide resolved

kannanjgithub reviewed Jun 25, 2025

View reviewed changes

opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java Show resolved Hide resolved

kannanjgithub reviewed Jun 25, 2025

View reviewed changes

suggested changes

43a746b

AgraVator requested a review from kannanjgithub July 14, 2025 06:58

AgraVator commented Jul 14, 2025

View reviewed changes

ejona86 reviewed Jul 28, 2025

View reviewed changes

Implement otel retry metrics #12064

Are you sure you want to change the base?

Implement otel retry metrics #12064

Uh oh!

Conversation

AgraVator commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AgraVator commented May 14, 2025 •

edited

Loading