We started getting pod out Of Memory Errors in every 2-3 Days. To debug We took heap dump and found co.elastic.apm.agent.profiler.CallTree [2 Million objects] Object itself was consuming more than 2.5GB Heap Memory Space and eventually pod was going OutOfMeory.
co.elastic.apm.agent.profiler.CallTree [2 Million objects]
Could you please help me on this. Let me know if need any information.
Thanks for reporting!
Yes, some more info may help:
Please take a look at the heap graphs provided in our APM metrics (JVMs) view in the entire time leading to the OOM error - does it seem like a leak, meaning - continuously increasing heap usage until depleted? Or is it a spike in heap usage shortly before the crash?
If possible, please provide such heap for analysis
Provide as many details as possible on these pods - exact OS, exact JVM version etc.
If there are any other configurations you make through env variables or otherwise, please provide them as well.
Until we get to the bottom of this, you can run your app without setting the -Delastic.apm.profiling_inferred_spans_enabled config to avoid these crashes.
Also, please upgrade to the latest agent version so we know we are not looking into something that is already irrelevant.
at jdk.internal.misc.Unsafe.park(ZJ)V (Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(J)V (LockSupport.java:357)
at co.elastic.apm.agent.profiler.SamplingProfiler.consumeActivationEventsFromRingBufferAndWriteToFile(Lco/elastic/apm/agent/configuration/converter/TimeDuration;)V (SamplingProfiler.java:395)
at co.elastic.apm.agent.profiler.SamplingProfiler.profile(Lco/elastic/apm/agent/configuration/converter/TimeDuration;Lco/elastic/apm/agent/configuration/converter/TimeDuration;)V (SamplingProfiler.java:346)
at co.elastic.apm.agent.profiler.SamplingProfiler.run()V (SamplingProfiler.java:317)
at java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object; (Executors.java:515)
at java.util.concurrent.FutureTask.run()V (FutureTask.java:264)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V (ScheduledThreadPoolExecutor.java:304)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run()V (ThreadPoolExecutor.java:628)
at java.lang.Thread.run()V (Thread.java:834)
@Ayush_Agrahari thanks for the additional info.
Any chance you answer my questions as well?
One of them was how reproducible this is. I ran some long load tests on an Oracle JDK and was not able to reproduce it so far. If it is easily reproducible in your Ubuntu setting, we have a good place to test it. Can you please try this snapshot and see if it resolves the problem? I did some initial analysis of the code and made some small changes in Object recycling.
This would be extremely useful! Please reenable the sampling profiler on one (or subset) of your pods but use the agent snapshot provided above. Set the JVM args to provide the most detailed heap dump if and when it crashes.
If this agent snapshot fixes the issue - great. If not, please try again to analyse your heap dump as you did, to figure out exactly which field/collection of the SamplingProfiler instance is holding these huge CallTree$Root objects (assuming you cannot provide the heap dump itself).
Thanks for the update @Ayush_Agrahari, this is very encouraging!
Please try to remember to provide another update in a week or so, if possible.
Also, you should be aware that this fix was released with version 1.20.0, so you can upgrade to it in your next update cycle.