We started getting pod out Of Memory Errors in every 2-3 Days. To debug We took heap dump and found co.elastic.apm.agent.profiler.CallTree [2 Million objects] Object itself was consuming more than 2.5GB Heap Memory Space and eventually pod was going OutOfMeory.
co.elastic.apm.agent.profiler.CallTree [2 Million objects]
Could you please help me on this. Let me know if need any information.
Thanks for reporting!
Yes, some more info may help:
Please take a look at the heap graphs provided in our APM metrics (JVMs) view in the entire time leading to the OOM error - does it seem like a leak, meaning - continuously increasing heap usage until depleted? Or is it a spike in heap usage shortly before the crash?
If possible, please provide such heap for analysis
Provide as many details as possible on these pods - exact OS, exact JVM version etc.
If there are any other configurations you make through env variables or otherwise, please provide them as well.
Until we get to the bottom of this, you can run your app without setting the -Delastic.apm.profiling_inferred_spans_enabled config to avoid these crashes.
Also, please upgrade to the latest agent version so we know we are not looking into something that is already irrelevant.
It would not be possible to share heap for analysis. I am sharing screenshot of analysis. Please let me know what steps need to be followed in analysis. Would do and provide you info.
Definitely looks like a memory leak
I will look into that. In the mean time, can you analyze heap references and see what holds all these CallTree objects (which collection in which class)?
Thanks!
elastic-apm-sampling-profiler
at jdk.internal.misc.Unsafe.park(ZJ)V (Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(J)V (LockSupport.java:357)
at co.elastic.apm.agent.profiler.SamplingProfiler.consumeActivationEventsFromRingBufferAndWriteToFile(Lco/elastic/apm/agent/configuration/converter/TimeDuration;)V (SamplingProfiler.java:395)
at co.elastic.apm.agent.profiler.SamplingProfiler.profile(Lco/elastic/apm/agent/configuration/converter/TimeDuration;Lco/elastic/apm/agent/configuration/converter/TimeDuration;)V (SamplingProfiler.java:346)
at co.elastic.apm.agent.profiler.SamplingProfiler.run()V (SamplingProfiler.java:317)
at java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object; (Executors.java:515)
at java.util.concurrent.FutureTask.run()V (FutureTask.java:264)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V (ScheduledThreadPoolExecutor.java:304)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run()V (ThreadPoolExecutor.java:628)
at java.lang.Thread.run()V (Thread.java:834)
@Ayush_Agrahari thanks for the additional info.
Any chance you answer my questions as well?
One of them was how reproducible this is. I ran some long load tests on an Oracle JDK and was not able to reproduce it so far. If it is easily reproducible in your Ubuntu setting, we have a good place to test it. Can you please try this snapshot and see if it resolves the problem? I did some initial analysis of the code and made some small changes in Object recycling.
No, As this error we are getting repeatedly in every 3-4 Days.
We redeploy pod it means old pod would be destroyed and new Pod JVM would be deployed.
We get this issue in every 2 to 3 Days. For now As per your suggestion We disabled profiling hence not getting that issue. We can reproduce it in our system. It takes 2-3 Days.
Nothing as such. During last 1-3 hour before OOM Thread count got increased by 5-10%.
This would be extremely useful! Please reenable the sampling profiler on one (or subset) of your pods but use the agent snapshot provided above. Set the JVM args to provide the most detailed heap dump if and when it crashes.
If this agent snapshot fixes the issue - great. If not, please try again to analyse your heap dump as you did, to figure out exactly which field/collection of the SamplingProfiler instance is holding these huge CallTree$Root objects (assuming you cannot provide the heap dump itself).
@Ayush_Agrahari one last ping - we are about to merge the related PR that contains the proposed fix, so it would be nice to get your feedback on it.
Thanks
@Eyal_Koren Apology for late reply. We did not get OOM issue with patch version. We started using patch version from last week due to monthly deployment cycle. Will udpate you once we get same issue.
Thanks for the update @Ayush_Agrahari, this is very encouraging!
Please try to remember to provide another update in a week or so, if possible.
Also, you should be aware that this fix was released with version 1.20.0, so you can upgrade to it in your next update cycle.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.