-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Otel Java Agent Causing Heap Memory Leak Issue #12303
Comments
안녕하세요. 저 역시도 유사한 문제를 겪었고 이를 트러블슈팅하고 있는 사용자입니다. 제 해결 경험이 도움이 되고자 하여 이를 정리하였습니다. 상황저의 경우는 Java Agent로 자동 계측되는 Span 외에도 Agent의 Extension을 통해 별도의 Span과 Attributes를 수집하였습니다. 원인제가 추정한 원인과 해결은 다음과 같습니다. (문제가 해결되었으나, 정확하지는 않으므로 당신의 판단이 필요합니다.)
당신의 빠른 문제 해결을 기원합니다. Hello, I have experienced a similar issue and have been troubleshooting it myself. I’ve compiled my resolution experience in the hope that it may be helpful. Like you, I encountered the problem where memory was continuously allocated to the Old Gen during batch processing and was not being released. SituationIn my case, I collected additional Spans and Attributes through the Agent’s Extension, aside from the Spans automatically instrumented by the Java Agent. As data collection occurred, memory leakage gradually increased, and after the server went into production, most of the allocated heap was occupied by the Old Gen, causing the server to crash with an OutOfMemoryError (OOM). Although it seemed like the issue was resolved temporarily by garbage collection (GC), newly generated Spans were immediately allocated to the Old Gen. CauseThe cause and solution I identified are as follows (the issue was resolved in my case, but the accuracy is not guaranteed, so you should verify it yourself). If you observe a continuous increase in the Old Gen in the metrics you are collecting, it might be a similar case to mine.
I hope this helps you resolve your issue quickly. |
@vanilla-sundae thanks for reporting, unfortunately the information provided is not enough to understand and fix the issue. You should examine the heap dump and try to answer the following.
|
This has been automatically marked as stale because it has been marked as needing author feedback and has not had any activity for 7 days. It will be closed automatically if there is no response from the author within 7 additional days from this comment. |
Hi @laurit thanks for your response. We are seeing a large number of The incoming references for Also seeing some error messages I suspect those objects are promoted to Old Gen prematurely because I'm seeing our Old Gen Space usage metrics increase before our heap memory increase |
You can ignore these, these look like string literals used in a class.
Based on the first screenshot I can see that you have at least 2 large maps. Are there more? Based on the second screenshot I know what the first map is. Could you also look at the incoming references for the second map, if it has a reference to a class with really long name starting with |
I saw this pull request #12397 linking this issue. Is this potentially a fix for this issue? We are using AWS Distro https://github.com/aws-observability/aws-otel-java-instrumentation which only uses the version |
it would be great if you could download the SNAPSHOT from that PR and give it a test (https://github.com/open-telemetry/opentelemetry-java-instrumentation/actions/runs/11175637247/artifacts/2014579460)
yes |
Thanks for the quick response, our service has CI/CD pipeline set up and we have to deploy to production environment to verify. It might take several days to get the result, but I'll post here as soon as I'm able to deploy and see the how it performs. |
@vanilla-sundae #12397 is not enough to fix your issue. |
got it, thanks! Do you think it would help if we add this When should we expect the fix to be released? Also let me know if you need anything else from me for your investigation. |
Describe the bug
Context
My service uses Otel Java agent published by this library https://github.com/aws-observability/aws-otel-java-instrumentation
. with annotations
@WithSpan
and@SpanAttribute
(https://opentelemetry.io/docs/zero-code/java/agent/annotations/) in the code to get traces for our requests.Problem Statement
Otel Java agent was set up correctly, and no memory issue with initial setup. However, it's after we add annotations
@WithSpan
and@SpanAttribute
to the service code that we started to see a periodic memory increase issue (JVM metricHeapMemoryAfterGCUse
increased to almost 100%) with a lot of otel objects created on the heap, and we have to bounce our hosts to mitigate it.Otel objects we saw are mainly
io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap$WeakKey
andio.opentelemetry.javaagent.bootstrap.executors.PropagatedContext
, as well as java objectsjava.util.concurrent.ConcurrentHashMap$Node
andjava.lang.ref.WeakReference
We added
@WithSpan
to methods executed by child threads and virtual threads, not sure if that would be a concern. But we are able to view traces for these methods correctly.Here's our heap dump result:
Histogram:
Memory Leak Suspect Report:
Ask
Can anyone help with this issue and let us know what the root cause could be?
Steps to reproduce
We set up java agent in our service docker image file:
And we add
@WithSpan
to methods and@SpanAttribute
to one of the arguments.Expected behavior
No or minimum impact on heap memory usage.
Actual behavior
Heap memory usage after GC increase to 100% if we don't bounce the hosts.
Javaagent or library instrumentation version
v1.32.3
Environment
JDK: JDK21
OS: Linux x86_64
Additional context
No response
The text was updated successfully, but these errors were encountered: