APM Consuming lot of HEAP Memory

Hello ,

We have deployed one java service into kubernetes with Heap Size around 4GB. We also run APM along with it using below command
.

java -javaagent:/home/user/elastic-apm-agent-1.18.0.RC1.jar -jar -Delastic.apm.service_name=A -Delastic.apm.server_urls=http://192.168.x.x:8200 -Delastic.apm.environment=Integration -Delastic.apm.profiling_inferred_spans_enabled=true -Delastic.apm.enable_log_correlation=true -Delastic.apm.profiling_inferred_spans_excluded_classes=co.elastic.* /home/user/A.war

We started getting pod out Of Memory Errors in every 2-3 Days. To debug We took heap dump and found co.elastic.apm.agent.profiler.CallTree [2 Million objects] Object itself was consuming more than 2.5GB Heap Memory Space and eventually pod was going OutOfMeory.

co.elastic.apm.agent.profiler.CallTree [2 Million objects]

Could you please help me on this. Let me know if need any information.

Thanks for reporting!
Yes, some more info may help:

  1. Please take a look at the heap graphs provided in our APM metrics (JVMs) view in the entire time leading to the OOM error - does it seem like a leak, meaning - continuously increasing heap usage until depleted? Or is it a spike in heap usage shortly before the crash?
  2. If possible, please provide such heap for analysis
  3. Provide as many details as possible on these pods - exact OS, exact JVM version etc.
  4. If there are any other configurations you make through env variables or otherwise, please provide them as well.

Until we get to the bottom of this, you can run your app without setting the -Delastic.apm.profiling_inferred_spans_enabled config to avoid these crashes.

Also, please upgrade to the latest agent version so we know we are not looking into something that is already irrelevant.

Thanks!

Thanks @Eyal_Koren for replying and providing inputs.

Please find information as you asked.

It's memory leak. Continously increasing heap usage until depleted.

It would not be possible to share heap for analysis. I am sharing screenshot of analysis. Please let me know what steps need to be followed in analysis. Would do and provide you info.

I am sharing JVM and OS Version information below.

No Other configuration. I am sharing config file of APM Server below.

apm-server:
  # Defines the host and port the server is listening on. Use "unix:/path/to.sock" to listen on a unix domain socket.
  host: "0.0.0.0:8200"
  rum.enabled: true
  kibana:
    enabled: true
    host: "https://${KIBANA_HOST}:${KIBANA_PORT}"
    #protocol: "https"
    username: ${ELASTIC_USERNAME}
    password: ${ELASTIC_PASSWORD}
    ssl.enabled: true
    ssl.verification_mode: none
    ssl.certificate_authorities: [ "/certs/ca.crt" ]

monitoring:
  enabled: true

output.elasticsearch:
  hosts: [ "https://${ELASTICSEARCH_HOST_PORT}" ]
  username: ${ELASTIC_USERNAME}
  password: ${ELASTIC_PASSWORD}
  ssl.enabled: true
  ssl.certificate_authorities: [ "/certs/ca.crt" ]

output.elasticsearch.bulk_max_size: 1500
output.elasticsearch.worker: 3
queue.mem.events: 4500

We are now upgrading to latest version and have disabled -Delastic.apm.profiling_inferred_spans_enabled for now.

Please let me know if you need any more information.

Thanks,
Ayush A.

Definitely looks like a memory leak :frowning:
I will look into that. In the mean time, can you analyze heap references and see what holds all these CallTree objects (which collection in which class)?
Thanks!

It looks like it took 3-4 hours before the leak took effect.
So a couple more questions for analysis:

  1. Did you dynamically change any configuration?
  2. Can you find something interesting in logs (agent or app) somewhere between 10PM of the 21st and 2AM of the 22nd?
  3. Can you think of anything else that changed or happened during this time?
  4. Do you redeploy/stop/start the application running on this JVM (without restarting the entire JVM/Servlet-container/pod) from time to time?
  5. How reproducible is this behaviour? If you run the same settings again, would it look the same?
  6. Anything else interesting you noticed in the metrics tab? For example, did the number of threads remain stable?

Thanks.

Hey @Eyal_Koren,

Sharing more stats related on Heap consumption by CallTree Objects.


Thread Stack.

elastic-apm-sampling-profiler
  at jdk.internal.misc.Unsafe.park(ZJ)V (Native Method)
  at java.util.concurrent.locks.LockSupport.parkNanos(J)V (LockSupport.java:357)
  at co.elastic.apm.agent.profiler.SamplingProfiler.consumeActivationEventsFromRingBufferAndWriteToFile(Lco/elastic/apm/agent/configuration/converter/TimeDuration;)V (SamplingProfiler.java:395)
  at co.elastic.apm.agent.profiler.SamplingProfiler.profile(Lco/elastic/apm/agent/configuration/converter/TimeDuration;Lco/elastic/apm/agent/configuration/converter/TimeDuration;)V (SamplingProfiler.java:346)
  at co.elastic.apm.agent.profiler.SamplingProfiler.run()V (SamplingProfiler.java:317)
  at java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object; (Executors.java:515)
  at java.util.concurrent.FutureTask.run()V (FutureTask.java:264)
  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V (ScheduledThreadPoolExecutor.java:304)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (ThreadPoolExecutor.java:1128)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run()V (ThreadPoolExecutor.java:628)
  at java.lang.Thread.run()V (Thread.java:834)

Screenshot from 2020-12-28 09-11-01

@Ayush_Agrahari thanks for the additional info.
Any chance you answer my questions as well? :slight_smile:

One of them was how reproducible this is. I ran some long load tests on an Oracle JDK and was not able to reproduce it so far. If it is easily reproducible in your Ubuntu setting, we have a good place to test it. Can you please try this snapshot and see if it resolves the problem? I did some initial analysis of the code and made some small changes in Object recycling.

Thanks @Eyal_Koren . Please find answers on asked questions.

No @Eyal_Koren

Nothing interesting I found.

No, As this error we are getting repeatedly in every 3-4 Days.

We redeploy pod it means old pod would be destroyed and new Pod JVM would be deployed.

We get this issue in every 2 to 3 Days. For now As per your suggestion We disabled profiling hence not getting that issue. We can reproduce it in our system. It takes 2-3 Days.

Nothing as such. During last 1-3 hour before OOM Thread count got increased by 5-10%.

This would be extremely useful! Please reenable the sampling profiler on one (or subset) of your pods but use the agent snapshot provided above. Set the JVM args to provide the most detailed heap dump if and when it crashes.

If this agent snapshot fixes the issue - great. If not, please try again to analyse your heap dump as you did, to figure out exactly which field/collection of the SamplingProfiler instance is holding these huge CallTree$Root objects (assuming you cannot provide the heap dump itself).

Thanks!

Thanks @Eyal_Koren for quick help and sharing JAR link.

I will update APM Agent and would observe JVM for OOM Error. If we get again OOM then would share stats with you.

Thanks Again,

1 Like

If not reproduced, please update as well :slight_smile:

Sure @Eyal_Koren

Hey @Ayush_Agrahari, did you get the chance to test the fix?

@Ayush_Agrahari one last ping - we are about to merge the related PR that contains the proposed fix, so it would be nice to get your feedback on it.
Thanks :pray:

@Eyal_Koren Apology for late reply. We did not get OOM issue with patch version. We started using patch version from last week due to monthly deployment cycle. Will udpate you once we get same issue.

Thanks :pray:

Thanks for the update @Ayush_Agrahari, this is very encouraging!
Please try to remember to provide another update in a week or so, if possible.
Also, you should be aware that this fix was released with version 1.20.0, so you can upgrade to it in your next update cycle.

1 Like