APM Consuming lot of HEAP Memory

Ayush_Agrahari · December 24, 2020, 9:17am

Hello ,

We have deployed one java service into kubernetes with Heap Size around 4GB. We also run APM along with it using below command
.

java -javaagent:/home/user/elastic-apm-agent-1.18.0.RC1.jar -jar -Delastic.apm.service_name=A -Delastic.apm.server_urls=http://192.168.x.x:8200 -Delastic.apm.environment=Integration -Delastic.apm.profiling_inferred_spans_enabled=true -Delastic.apm.enable_log_correlation=true -Delastic.apm.profiling_inferred_spans_excluded_classes=co.elastic.* /home/user/A.war

We started getting pod out Of Memory Errors in every 2-3 Days. To debug We took heap dump and found co.elastic.apm.agent.profiler.CallTree [2 Million objects] Object itself was consuming more than 2.5GB Heap Memory Space and eventually pod was going OutOfMeory.

co.elastic.apm.agent.profiler.CallTree [2 Million objects]

Could you please help me on this. Let me know if need any information.

Eyal_Koren · December 24, 2020, 1:37pm

Thanks for reporting!
Yes, some more info may help:

Please take a look at the heap graphs provided in our APM metrics (JVMs) view in the entire time leading to the OOM error - does it seem like a leak, meaning - continuously increasing heap usage until depleted? Or is it a spike in heap usage shortly before the crash?
If possible, please provide such heap for analysis
Provide as many details as possible on these pods - exact OS, exact JVM version etc.
If there are any other configurations you make through env variables or otherwise, please provide them as well.

Until we get to the bottom of this, you can run your app without setting the -Delastic.apm.profiling_inferred_spans_enabled config to avoid these crashes.

Also, please upgrade to the latest agent version so we know we are not looking into something that is already irrelevant.

Thanks!

Ayush_Agrahari · December 25, 2020, 4:15am

Thanks @Eyal_Koren for replying and providing inputs.

Please find information as you asked.

It's memory leak. Continously increasing heap usage until depleted.

It would not be possible to share heap for analysis. I am sharing screenshot of analysis. Please let me know what steps need to be followed in analysis. Would do and provide you info.

I am sharing JVM and OS Version information below.

No Other configuration. I am sharing config file of APM Server below.

apm-server:
  # Defines the host and port the server is listening on. Use "unix:/path/to.sock" to listen on a unix domain socket.
  host: "0.0.0.0:8200"
  rum.enabled: true
  kibana:
    enabled: true
    host: "https://${KIBANA_HOST}:${KIBANA_PORT}"
    #protocol: "https"
    username: ${ELASTIC_USERNAME}
    password: ${ELASTIC_PASSWORD}
    ssl.enabled: true
    ssl.verification_mode: none
    ssl.certificate_authorities: [ "/certs/ca.crt" ]

monitoring:
  enabled: true

output.elasticsearch:
  hosts: [ "https://${ELASTICSEARCH_HOST_PORT}" ]
  username: ${ELASTIC_USERNAME}
  password: ${ELASTIC_PASSWORD}
  ssl.enabled: true
  ssl.certificate_authorities: [ "/certs/ca.crt" ]

output.elasticsearch.bulk_max_size: 1500
output.elasticsearch.worker: 3
queue.mem.events: 4500

We are now upgrading to latest version and have disabled -Delastic.apm.profiling_inferred_spans_enabled for now.

Please let me know if you need any more information.

Thanks,
Ayush A.

Eyal_Koren · December 27, 2020, 7:22am

Definitely looks like a memory leak
I will look into that. In the mean time, can you analyze heap references and see what holds all these CallTree objects (which collection in which class)?
Thanks!

Eyal_Koren · December 27, 2020, 11:06am

It looks like it took 3-4 hours before the leak took effect.
So a couple more questions for analysis:

Did you dynamically change any configuration?
Can you find something interesting in logs (agent or app) somewhere between 10PM of the 21st and 2AM of the 22nd?
Can you think of anything else that changed or happened during this time?
Do you redeploy/stop/start the application running on this JVM (without restarting the entire JVM/Servlet-container/pod) from time to time?
How reproducible is this behaviour? If you run the same settings again, would it look the same?
Anything else interesting you noticed in the metrics tab? For example, did the number of threads remain stable?

Thanks.

Ayush_Agrahari · December 28, 2020, 3:29am

Hey @Eyal_Koren,

Sharing more stats related on Heap consumption by CallTree Objects.

Thread Stack.

elastic-apm-sampling-profiler
  at jdk.internal.misc.Unsafe.park(ZJ)V (Native Method)
  at java.util.concurrent.locks.LockSupport.parkNanos(J)V (LockSupport.java:357)
  at co.elastic.apm.agent.profiler.SamplingProfiler.consumeActivationEventsFromRingBufferAndWriteToFile(Lco/elastic/apm/agent/configuration/converter/TimeDuration;)V (SamplingProfiler.java:395)
  at co.elastic.apm.agent.profiler.SamplingProfiler.profile(Lco/elastic/apm/agent/configuration/converter/TimeDuration;Lco/elastic/apm/agent/configuration/converter/TimeDuration;)V (SamplingProfiler.java:346)
  at co.elastic.apm.agent.profiler.SamplingProfiler.run()V (SamplingProfiler.java:317)
  at java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object; (Executors.java:515)
  at java.util.concurrent.FutureTask.run()V (FutureTask.java:264)
  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V (ScheduledThreadPoolExecutor.java:304)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (ThreadPoolExecutor.java:1128)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run()V (ThreadPoolExecutor.java:628)
  at java.lang.Thread.run()V (Thread.java:834)

Screenshot from 2020-12-28 09-11-01

Eyal_Koren · December 28, 2020, 6:36am

@Ayush_Agrahari thanks for the additional info.
Any chance you answer my questions as well?

One of them was how reproducible this is. I ran some long load tests on an Oracle JDK and was not able to reproduce it so far. If it is easily reproducible in your Ubuntu setting, we have a good place to test it. Can you please try this snapshot and see if it resolves the problem? I did some initial analysis of the code and made some small changes in Object recycling.

Ayush_Agrahari · December 28, 2020, 6:43am

Thanks @Eyal_Koren . Please find answers on asked questions.

No @Eyal_Koren

Nothing interesting I found.

No, As this error we are getting repeatedly in every 3-4 Days.

We redeploy pod it means old pod would be destroyed and new Pod JVM would be deployed.

We get this issue in every 2 to 3 Days. For now As per your suggestion We disabled profiling hence not getting that issue. We can reproduce it in our system. It takes 2-3 Days.

Nothing as such. During last 1-3 hour before OOM Thread count got increased by 5-10%.

Eyal_Koren · December 28, 2020, 6:55am

This would be extremely useful! Please reenable the sampling profiler on one (or subset) of your pods but use the agent snapshot provided above. Set the JVM args to provide the most detailed heap dump if and when it crashes.

If this agent snapshot fixes the issue - great. If not, please try again to analyse your heap dump as you did, to figure out exactly which field/collection of the SamplingProfiler instance is holding these huge CallTree$Root objects (assuming you cannot provide the heap dump itself).

Thanks!

Ayush_Agrahari · December 28, 2020, 7:25am

Thanks @Eyal_Koren for quick help and sharing JAR link.

I will update APM Agent and would observe JVM for OOM Error. If we get again OOM then would share stats with you.

Thanks Again,

Eyal_Koren · December 28, 2020, 7:27am

If not reproduced, please update as well

Ayush_Agrahari · December 28, 2020, 7:35am

Sure @Eyal_Koren

Eyal_Koren · January 3, 2021, 7:23am

Hey @Ayush_Agrahari, did you get the chance to test the fix?

Eyal_Koren · January 6, 2021, 7:18am

@Ayush_Agrahari one last ping - we are about to merge the related PR that contains the proposed fix, so it would be nice to get your feedback on it.
Thanks

Ayush_Agrahari · January 12, 2021, 6:31am

@Eyal_Koren Apology for late reply. We did not get OOM issue with patch version. We started using patch version from last week due to monthly deployment cycle. Will udpate you once we get same issue.

Thanks

Eyal_Koren · January 12, 2021, 7:39am

Thanks for the update @Ayush_Agrahari, this is very encouraging!
Please try to remember to provide another update in a week or so, if possible.
Also, you should be aware that this fix was released with version 1.20.0, so you can upgrade to it in your next update cycle.

Ayush_Agrahari · January 28, 2021, 7:29am

Hi @Eyal_Koren ,

It's been days and we did not find heap issue so far. Thanks for your help

system · February 18, 2021, 3:29am

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
APM Consuming lot of reatined Heap and Blocked APM Threads in Thread Dump APM java	7	997	June 27, 2021
The java application server in the pod was killed due to excessive memory usage APM java	7	513	May 8, 2024
90.5/7 OOM errors-- memory leak or GC problems? Elasticsearch	5	419	July 6, 2017
Memory problems Elasticsearch	27	1331	July 6, 2017
APM client drop queue if memory usage too big APM	8	1299	November 4, 2022

APM Consuming lot of HEAP Memory

Related topics