Otel Java Agent Causing Heap Memory Leak Issue #12303

vanilla-sundae · 2024-09-20T00:38:04Z

Describe the bug

Context

My service uses Otel Java agent published by this library https://github.com/aws-observability/aws-otel-java-instrumentation
. with annotations @WithSpan and @SpanAttribute (https://opentelemetry.io/docs/zero-code/java/agent/annotations/) in the code to get traces for our requests.

Problem Statement

Otel Java agent was set up correctly, and no memory issue with initial setup. However, it's after we add annotations @WithSpan and @SpanAttribute to the service code that we started to see a periodic memory increase issue (JVM metric HeapMemoryAfterGCUse increased to almost 100%) with a lot of otel objects created on the heap, and we have to bounce our hosts to mitigate it.

Otel objects we saw are mainly io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap$WeakKey and io.opentelemetry.javaagent.bootstrap.executors.PropagatedContext, as well as java objects java.util.concurrent.ConcurrentHashMap$Node and java.lang.ref.WeakReference

We added @WithSpan to methods executed by child threads and virtual threads, not sure if that would be a concern. But we are able to view traces for these methods correctly.

Here's our heap dump result:

Histogram:

Memory Leak Suspect Report:

Ask

Can anyone help with this issue and let us know what the root cause could be?

Steps to reproduce

We set up java agent in our service docker image file:

ADD https://github.com/aws-observability/aws-otel-java-instrumentation/releases/latest/download/aws-opentelemetry-agent.jar /opt/aws-opentelemetry-agent.jar
RUN chmod 644 /opt/aws-opentelemetry-agent.jar
ENV JAVA_TOOL_OPTIONS="-javaagent:/opt/aws-opentelemetry-agent.jar"
ENV OTEL_RESOURCE_ATTRIBUTES="service.name=XXX,service.namespace=XXX"
ENV OTEL_PROPAGATORS="tracecontext,baggage,xray"
ENV OTEL_TRACES_SAMPLER="traceidratio"
ENV OTEL_TRACES_SAMPLER_ARG="0.00001"
ENV OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"

And we add @WithSpan to methods and @SpanAttribute to one of the arguments.

@WithSpan
public void myMethod(@SpanAttribute SomeClass someObject) {
      <...>
}

Expected behavior

No or minimum impact on heap memory usage.

Actual behavior

Heap memory usage after GC increase to 100% if we don't bounce the hosts.

Javaagent or library instrumentation version

v1.32.3

Environment

JDK: JDK21
OS: Linux x86_64

Additional context

No response

The text was updated successfully, but these errors were encountered:

leon4652 · 2024-09-20T02:41:44Z

안녕하세요. 저 역시도 유사한 문제를 겪었고 이를 트러블슈팅하고 있는 사용자입니다. 제 해결 경험이 도움이 되고자 하여 이를 정리하였습니다.
저 역시 지속적으로 Batch 작업 발생 시 동일하게 Old Gen에 메모리가 적재되었고 이 메모리가 해제되지 않았던 문제를 겪었습니다.

상황

저의 경우는 Java Agent로 자동 계측되는 Span 외에도 Agent의 Extension을 통해 별도의 Span과 Attributes를 수집하였습니다.
데이터 수집이 이루어지며 점진적으로 메모리 누수가 증가했고, 서버 프로덕션 이후 대부분의 할당된 heap을 old gen이 차지하며 서버가 OOM에 빠져 멈춰버렸습니다. (GC가 이루어지며 해소되는 것으로 보이나, 새롭게 증가되는 Span이 바로 Old Gen에 할당됩니다.)

원인

제가 추정한 원인과 해결은 다음과 같습니다. (문제가 해결되었으나, 정확하지는 않으므로 당신의 판단이 필요합니다.)
만일 당신이 수집한 Metric에서 Old Gen이 지속적으로 증가하는 것을 확인하셨다면 아마 저와 비슷한 케이스로 의심됩니다.

SpanProcessor에 의해 수집되는 'Span' 객체 자체의 크기가 매우 커졌고, 배치 주기 사이에 저장되는 span들이 문제를 일으킵니다.
JVM의 Eden Space에서 이를 적재할 수 없어 조기 프로모션이 발생하였고 메모리 누수가 발생하게 됩니다.
이는 기본적으로 Span의 크기가 방대하게 수집되기에 생할 수 있는 일이며 (저의 경우는 Json으로 변환하였을때 하나의 Span당 대략 100KB~ 이상이었습니다.) 적게 할당되는 Eden의 메모리 용량이 버티지 못해 Old Gen으로 premature promotion되는 문제로 판단하였습니다.
제 생각에, @WithSpan을 사용하는 만큼, 이 부분에 대한 수집 데이터를 확인하는 것을 추천드립니다.

당신의 빠른 문제 해결을 기원합니다.

Hello,

I have experienced a similar issue and have been troubleshooting it myself. I’ve compiled my resolution experience in the hope that it may be helpful. Like you, I encountered the problem where memory was continuously allocated to the Old Gen during batch processing and was not being released.

Situation

In my case, I collected additional Spans and Attributes through the Agent’s Extension, aside from the Spans automatically instrumented by the Java Agent. As data collection occurred, memory leakage gradually increased, and after the server went into production, most of the allocated heap was occupied by the Old Gen, causing the server to crash with an OutOfMemoryError (OOM). Although it seemed like the issue was resolved temporarily by garbage collection (GC), newly generated Spans were immediately allocated to the Old Gen.

Cause

The cause and solution I identified are as follows (the issue was resolved in my case, but the accuracy is not guaranteed, so you should verify it yourself). If you observe a continuous increase in the Old Gen in the metrics you are collecting, it might be a similar case to mine.

The size of the 'Span' objects collected by the SpanProcessor had grown significantly, and the Spans stored between batch cycles were causing the issue.
The JVM’s Eden Space was unable to accommodate these objects, leading to early promotion, which resulted in memory leakage.
This issue likely arose because of the large size of the collected Spans. (In my case, when converted to JSON, each Span was approximately 100KB or larger.) The memory allocated to Eden was insufficient to handle this, causing premature promotion to the Old Gen.
I recommend reviewing the data collection associated with this, especially if you are using @WithSpan.

I hope this helps you resolve your issue quickly.

laurit · 2024-09-20T08:54:56Z

@vanilla-sundae thanks for reporting, unfortunately the information provided is not enough to understand and fix the issue. You should examine the heap dump and try to answer the following.

which map it is that is large
what kind of keys it has
why are there so many of them
can the keys be collected
if not, what keeps them alive
If you are not capable of analyzing the heap dump yourself you could turn to your otel vendor and see if they can help you.
Alternatively if you could provide a minimal application that reproduces the issue then that would also be great.

github-actions · 2024-10-01T14:32:38Z

This has been automatically marked as stale because it has been marked as needing author feedback and has not had any activity for 7 days. It will be closed automatically if there is no response from the author within 7 additional days from this comment.

vanilla-sundae · 2024-10-01T17:07:59Z

Hi @laurit thanks for your response.

We are seeing a large number of java.util.concurrent.ConcurrentHashMap objects in the heap holding the key of type io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap$WeakKey and the value of type io.opentelemetry.javaagent.bootstrap.executors.PropagatedContext.

The incoming references for ConcurrentHashMap mainly comes from our customized executor class ApplicationThreadExecutorService which extends ThreadPoolExecutor

Also seeing some error messages
Failed to propagate context because previous propagated context is already set; new: {0}, old: {1}
java.lang.String - nanosecond timeout value out of range
java.lang.String - timeoutMillis value is negative

I suspect those objects are promoted to Old Gen prematurely because I'm seeing our Old Gen Space usage metrics increase before our heap memory increase

laurit · 2024-10-02T17:06:24Z

Also seeing some error messages
Failed to propagate context because previous propagated context is already set; new: {0}, old: {1}
java.lang.String - nanosecond timeout value out of range
java.lang.String - timeoutMillis value is negative

You can ignore these, these look like string literals used in a class.

We are seeing a large number of java.util.concurrent.ConcurrentHashMap objects

Based on the first screenshot I can see that you have at least 2 large maps. Are there more? Based on the second screenshot I know what the first map is. Could you also look at the incoming references for the second map, if it has a reference to a class with really long name starting with io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$... similarly to what is at the bottom of the second screenshot then knowing that name would help.

vanilla-sundae · 2024-10-04T17:24:21Z

Based on the first screenshot I can see that you have at least 2 large maps. Are there more?

In the dominator tree, there are only 2 large maps. But if I go to the "top consumer" section, and list out "Biggest Top-Level Dominator Packages", there are 28 dominator maps. See the screenshot below

Could you also look at the incoming references for the second map, if it has a reference to a class with really long name starting with io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$... similarly to what is at the bottom of the second screenshot then knowing that name would help.

Here's the screenshot of the incoming references for the second map:

And for the first map, I get a better screenshot for this field

vanilla-sundae · 2024-10-04T17:35:33Z

I saw this pull request #12397 linking this issue. Is this potentially a fix for this issue?

We are using AWS Distro https://github.com/aws-observability/aws-otel-java-instrumentation which only uses the version v1.32.3. After this is merged, does it require them to update the version they are using?

trask · 2024-10-04T17:58:54Z

Is this potentially a fix for this issue?

it would be great if you could download the SNAPSHOT from that PR and give it a test (https://github.com/open-telemetry/opentelemetry-java-instrumentation/actions/runs/11175637247/artifacts/2014579460)

After this is merged, does it require them to update the version they are using?

yes

vanilla-sundae · 2024-10-04T18:16:04Z

Thanks for the quick response, our service has CI/CD pipeline set up and we have to deploy to production environment to verify. It might take several days to get the result, but I'll post here as soon as I'm able to deploy and see the how it performs.

laurit · 2024-10-05T06:34:35Z

@vanilla-sundae #12397 is not enough to fix your issue.

vanilla-sundae · 2024-10-07T15:52:30Z

got it, thanks! Do you think it would help if we add this executors config for our java agent?
https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/d646ad6270f57e12a71f59f1b3c35bab6e9b154f/instrumentation/executors
Asking because I'm seeing all the incoming references for ConcurrentHashMap mainly coms from our customized executor class ApplicationThreadExecutorService which extends ThreadPoolExecutor.

When should we expect the fix to be released? Also let me know if you need anything else from me for your investigation.

vanilla-sundae added bug Something isn't working needs triage New issue that requires triage labels Sep 20, 2024

laurit added the needs author feedback Waiting for additional feedback from the author label Sep 24, 2024

github-actions bot added the stale label Oct 1, 2024

github-actions bot removed needs author feedback Waiting for additional feedback from the author stale labels Oct 1, 2024

laurit mentioned this issue Oct 4, 2024

Clear context propagation virtual field #12397

Merged

laurit mentioned this issue Oct 15, 2024

Set up virtual field transforms before otel sdk is initialized #12444

Merged

trask closed this as completed in #12444 Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Otel Java Agent Causing Heap Memory Leak Issue #12303

Otel Java Agent Causing Heap Memory Leak Issue #12303

vanilla-sundae commented Sep 20, 2024

leon4652 commented Sep 20, 2024 •

edited

Loading

laurit commented Sep 20, 2024

github-actions bot commented Oct 1, 2024

vanilla-sundae commented Oct 1, 2024 •

edited

Loading

laurit commented Oct 2, 2024

vanilla-sundae commented Oct 4, 2024 •

edited

Loading

vanilla-sundae commented Oct 4, 2024

trask commented Oct 4, 2024

vanilla-sundae commented Oct 4, 2024

laurit commented Oct 5, 2024

vanilla-sundae commented Oct 7, 2024 •

edited

Loading

Otel Java Agent Causing Heap Memory Leak Issue #12303

Otel Java Agent Causing Heap Memory Leak Issue #12303

Comments

vanilla-sundae commented Sep 20, 2024

Describe the bug

Context

Problem Statement

Ask

Steps to reproduce

Expected behavior

Actual behavior

Javaagent or library instrumentation version

Environment

Additional context

leon4652 commented Sep 20, 2024 • edited Loading

상황

원인

Situation

Cause

laurit commented Sep 20, 2024

github-actions bot commented Oct 1, 2024

vanilla-sundae commented Oct 1, 2024 • edited Loading

laurit commented Oct 2, 2024

vanilla-sundae commented Oct 4, 2024 • edited Loading

vanilla-sundae commented Oct 4, 2024

trask commented Oct 4, 2024

vanilla-sundae commented Oct 4, 2024

laurit commented Oct 5, 2024

vanilla-sundae commented Oct 7, 2024 • edited Loading

leon4652 commented Sep 20, 2024 •

edited

Loading

vanilla-sundae commented Oct 1, 2024 •

edited

Loading

vanilla-sundae commented Oct 4, 2024 •

edited

Loading

vanilla-sundae commented Oct 7, 2024 •

edited

Loading