Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Otel Java Agent Causing Heap Memory Leak Issue #12303

Closed
vanilla-sundae opened this issue Sep 20, 2024 · 11 comments · Fixed by #12444
Closed

Otel Java Agent Causing Heap Memory Leak Issue #12303

vanilla-sundae opened this issue Sep 20, 2024 · 11 comments · Fixed by #12444
Labels
bug Something isn't working needs triage New issue that requires triage

Comments

@vanilla-sundae
Copy link

Describe the bug

Context

My service uses Otel Java agent published by this library https://github.com/aws-observability/aws-otel-java-instrumentation
. with annotations @WithSpan and @SpanAttribute (https://opentelemetry.io/docs/zero-code/java/agent/annotations/) in the code to get traces for our requests.

Problem Statement

Otel Java agent was set up correctly, and no memory issue with initial setup. However, it's after we add annotations @WithSpan and @SpanAttribute to the service code that we started to see a periodic memory increase issue (JVM metric HeapMemoryAfterGCUse increased to almost 100%) with a lot of otel objects created on the heap, and we have to bounce our hosts to mitigate it.

Otel objects we saw are mainly io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap$WeakKey and io.opentelemetry.javaagent.bootstrap.executors.PropagatedContext, as well as java objects java.util.concurrent.ConcurrentHashMap$Node and java.lang.ref.WeakReference

We added @WithSpan to methods executed by child threads and virtual threads, not sure if that would be a concern. But we are able to view traces for these methods correctly.

Here's our heap dump result:

Histogram:
Screenshot 2024-09-19 at 4 55 18 PM

Memory Leak Suspect Report:
Screenshot 2024-09-19 at 4 56 50 PM
Screenshot 2024-09-19 at 4 57 16 PM
Screenshot 2024-09-19 at 4 57 29 PM

Ask

Can anyone help with this issue and let us know what the root cause could be?

Steps to reproduce

We set up java agent in our service docker image file:

ADD https://github.com/aws-observability/aws-otel-java-instrumentation/releases/latest/download/aws-opentelemetry-agent.jar /opt/aws-opentelemetry-agent.jar
RUN chmod 644 /opt/aws-opentelemetry-agent.jar
ENV JAVA_TOOL_OPTIONS="-javaagent:/opt/aws-opentelemetry-agent.jar"
ENV OTEL_RESOURCE_ATTRIBUTES="service.name=XXX,service.namespace=XXX"
ENV OTEL_PROPAGATORS="tracecontext,baggage,xray"
ENV OTEL_TRACES_SAMPLER="traceidratio"
ENV OTEL_TRACES_SAMPLER_ARG="0.00001"
ENV OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"

And we add @WithSpan to methods and @SpanAttribute to one of the arguments.

@WithSpan
public void myMethod(@SpanAttribute SomeClass someObject) {
      <...>
}

Expected behavior

No or minimum impact on heap memory usage.

Actual behavior

Heap memory usage after GC increase to 100% if we don't bounce the hosts.

Javaagent or library instrumentation version

v1.32.3

Environment

JDK: JDK21
OS: Linux x86_64

Additional context

No response

@vanilla-sundae vanilla-sundae added bug Something isn't working needs triage New issue that requires triage labels Sep 20, 2024
@leon4652
Copy link

leon4652 commented Sep 20, 2024

안녕하세요. 저 역시도 유사한 문제를 겪었고 이를 트러블슈팅하고 있는 사용자입니다. 제 해결 경험이 도움이 되고자 하여 이를 정리하였습니다.
저 역시 지속적으로 Batch 작업 발생 시 동일하게 Old Gen에 메모리가 적재되었고 이 메모리가 해제되지 않았던 문제를 겪었습니다.

상황

저의 경우는 Java Agent로 자동 계측되는 Span 외에도 Agent의 Extension을 통해 별도의 Span과 Attributes를 수집하였습니다.
데이터 수집이 이루어지며 점진적으로 메모리 누수가 증가했고, 서버 프로덕션 이후 대부분의 할당된 heap을 old gen이 차지하며 서버가 OOM에 빠져 멈춰버렸습니다. (GC가 이루어지며 해소되는 것으로 보이나, 새롭게 증가되는 Span이 바로 Old Gen에 할당됩니다.)

원인

제가 추정한 원인과 해결은 다음과 같습니다. (문제가 해결되었으나, 정확하지는 않으므로 당신의 판단이 필요합니다.)
만일 당신이 수집한 Metric에서 Old Gen이 지속적으로 증가하는 것을 확인하셨다면 아마 저와 비슷한 케이스로 의심됩니다.

  1. SpanProcessor에 의해 수집되는 'Span' 객체 자체의 크기가 매우 커졌고, 배치 주기 사이에 저장되는 span들이 문제를 일으킵니다.
  2. JVM의 Eden Space에서 이를 적재할 수 없어 조기 프로모션이 발생하였고 메모리 누수가 발생하게 됩니다.
  3. 이는 기본적으로 Span의 크기가 방대하게 수집되기에 생할 수 있는 일이며 (저의 경우는 Json으로 변환하였을때 하나의 Span당 대략 100KB~ 이상이었습니다.) 적게 할당되는 Eden의 메모리 용량이 버티지 못해 Old Gen으로 premature promotion되는 문제로 판단하였습니다.
  4. 제 생각에, @WithSpan을 사용하는 만큼, 이 부분에 대한 수집 데이터를 확인하는 것을 추천드립니다.

당신의 빠른 문제 해결을 기원합니다.


Hello,

I have experienced a similar issue and have been troubleshooting it myself. I’ve compiled my resolution experience in the hope that it may be helpful. Like you, I encountered the problem where memory was continuously allocated to the Old Gen during batch processing and was not being released.

Situation

In my case, I collected additional Spans and Attributes through the Agent’s Extension, aside from the Spans automatically instrumented by the Java Agent. As data collection occurred, memory leakage gradually increased, and after the server went into production, most of the allocated heap was occupied by the Old Gen, causing the server to crash with an OutOfMemoryError (OOM). Although it seemed like the issue was resolved temporarily by garbage collection (GC), newly generated Spans were immediately allocated to the Old Gen.

Cause

The cause and solution I identified are as follows (the issue was resolved in my case, but the accuracy is not guaranteed, so you should verify it yourself). If you observe a continuous increase in the Old Gen in the metrics you are collecting, it might be a similar case to mine.

  1. The size of the 'Span' objects collected by the SpanProcessor had grown significantly, and the Spans stored between batch cycles were causing the issue.
  2. The JVM’s Eden Space was unable to accommodate these objects, leading to early promotion, which resulted in memory leakage.
  3. This issue likely arose because of the large size of the collected Spans. (In my case, when converted to JSON, each Span was approximately 100KB or larger.) The memory allocated to Eden was insufficient to handle this, causing premature promotion to the Old Gen.
  4. I recommend reviewing the data collection associated with this, especially if you are using @WithSpan.

I hope this helps you resolve your issue quickly.

@laurit
Copy link
Contributor

laurit commented Sep 20, 2024

@vanilla-sundae thanks for reporting, unfortunately the information provided is not enough to understand and fix the issue. You should examine the heap dump and try to answer the following.

  • which map it is that is large
  • what kind of keys it has
  • why are there so many of them
  • can the keys be collected
  • if not, what keeps them alive
    If you are not capable of analyzing the heap dump yourself you could turn to your otel vendor and see if they can help you.
    Alternatively if you could provide a minimal application that reproduces the issue then that would also be great.

@laurit laurit added the needs author feedback Waiting for additional feedback from the author label Sep 24, 2024
Copy link
Contributor

github-actions bot commented Oct 1, 2024

This has been automatically marked as stale because it has been marked as needing author feedback and has not had any activity for 7 days. It will be closed automatically if there is no response from the author within 7 additional days from this comment.

@github-actions github-actions bot added the stale label Oct 1, 2024
@vanilla-sundae
Copy link
Author

vanilla-sundae commented Oct 1, 2024

Hi @laurit thanks for your response.

We are seeing a large number of java.util.concurrent.ConcurrentHashMap objects in the heap holding the key of type io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap$WeakKey and the value of type io.opentelemetry.javaagent.bootstrap.executors.PropagatedContext.

image

The incoming references for ConcurrentHashMap mainly comes from our customized executor class ApplicationThreadExecutorService which extends ThreadPoolExecutor

image

Also seeing some error messages
Failed to propagate context because previous propagated context is already set; new: {0}, old: {1}
java.lang.String - nanosecond timeout value out of range
java.lang.String - timeoutMillis value is negative

image

image (4)

I suspect those objects are promoted to Old Gen prematurely because I'm seeing our Old Gen Space usage metrics increase before our heap memory increase

image (5)

@github-actions github-actions bot removed needs author feedback Waiting for additional feedback from the author stale labels Oct 1, 2024
@laurit
Copy link
Contributor

laurit commented Oct 2, 2024

Also seeing some error messages
Failed to propagate context because previous propagated context is already set; new: {0}, old: {1}
java.lang.String - nanosecond timeout value out of range
java.lang.String - timeoutMillis value is negative

You can ignore these, these look like string literals used in a class.

We are seeing a large number of java.util.concurrent.ConcurrentHashMap objects

Based on the first screenshot I can see that you have at least 2 large maps. Are there more? Based on the second screenshot I know what the first map is. Could you also look at the incoming references for the second map, if it has a reference to a class with really long name starting with io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$... similarly to what is at the bottom of the second screenshot then knowing that name would help.

@vanilla-sundae
Copy link
Author

vanilla-sundae commented Oct 4, 2024

Based on the first screenshot I can see that you have at least 2 large maps. Are there more?

In the dominator tree, there are only 2 large maps. But if I go to the "top consumer" section, and list out "Biggest Top-Level Dominator Packages", there are 28 dominator maps. See the screenshot below

image

Could you also look at the incoming references for the second map, if it has a reference to a class with really long name starting with io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$... similarly to what is at the bottom of the second screenshot then knowing that name would help.

Here's the screenshot of the incoming references for the second map:
image

And for the first map, I get a better screenshot for this field
image

@vanilla-sundae
Copy link
Author

I saw this pull request #12397 linking this issue. Is this potentially a fix for this issue?

We are using AWS Distro https://github.com/aws-observability/aws-otel-java-instrumentation which only uses the version v1.32.3. After this is merged, does it require them to update the version they are using?

@trask
Copy link
Member

trask commented Oct 4, 2024

Is this potentially a fix for this issue?

it would be great if you could download the SNAPSHOT from that PR and give it a test (https://github.com/open-telemetry/opentelemetry-java-instrumentation/actions/runs/11175637247/artifacts/2014579460)

After this is merged, does it require them to update the version they are using?

yes

@vanilla-sundae
Copy link
Author

Thanks for the quick response, our service has CI/CD pipeline set up and we have to deploy to production environment to verify. It might take several days to get the result, but I'll post here as soon as I'm able to deploy and see the how it performs.

@laurit
Copy link
Contributor

laurit commented Oct 5, 2024

@vanilla-sundae #12397 is not enough to fix your issue.

@vanilla-sundae
Copy link
Author

vanilla-sundae commented Oct 7, 2024

got it, thanks! Do you think it would help if we add this executors config for our java agent?
https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/d646ad6270f57e12a71f59f1b3c35bab6e9b154f/instrumentation/executors
Asking because I'm seeing all the incoming references for ConcurrentHashMap mainly coms from our customized executor class ApplicationThreadExecutorService which extends ThreadPoolExecutor.

When should we expect the fix to be released? Also let me know if you need anything else from me for your investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New issue that requires triage
Projects
None yet
4 participants