APM Consuming lot of reatined Heap and Blocked APM Threads in Thread Dump

Hello,

We have deployed one java service into Kubernetes with a Heap Size of around 8GB. We also run APM along with it using the below command.

java -javaagent:/home/user/elastic-apm-agent-1.23.0.jar -jar -Delastic.apm.service_name=A -Delastic.apm.server_urls=http://192.168.x.x:8200 -Delastic.apm.environment=Development -Delastic.apm.profiling_inferred_spans_enabled=true -Delastic.apm.enable_log_correlation=true -Delastic.apm.profiling_inferred_spans_excluded_classes=co.elastic.* /home/user/A.war

We are using the latest APM Version(1.23.0).

We are running our setup in Kubernetes. We found that the Service pod was getting restarted frequently with load testing and the root cause was OOM.

Observation: It does not seem memory leak as we could not found a memory leak pattern for the long-running pod. It seems more that some processes are taking a lot of Heap when getting load.

To Debug Further, we took some Heap Dump of the running pod and we found that APM was consuming more than 40% of Retained Heap, and also APM threads were in the blocked state in Thread Dump.

Attaching Heap Dump Snapshots & Required Details to you for debugging further.

113,652,168 bytes (35.77 %) of Java heap is used by 3,071 instances of java/util/concurrent/ConcurrentHashMap$Node

co/elastic/apm/agent/shaded/bytebuddy/pool/TypePool$CacheProvider$Simple at 0x7fd1b35a0

JVM & OS:

Could you please help me with this. Let me know if need any information.

@Eyal_Koren Could you or someone from APM team please help here.

Hi @Ayush_Agrahari :wave:
Great analysis, very useful!

Unless you want to increase the heap size, you would need to disable type-pool caching by setting the enable_type_pool_cache config option to false. It is not a documented one, because it is very rarely changed from the default.

The way to set it would be one of the following:

  • through an agent config file - enable_type_pool_cache=false
  • as a system property in the command line: -Delastic.apm.enable_type_pool_cache=false
  • as an environment variable: ELASTIC_APM_ENABLE_TYPE_POOL_CACHE=false

It may make startup time longer, but based on your configured heap, I assume this is not a huge app, so it wouldn't necessarily be an issue.

I hope this helps.

Also note that the type cache is cleared after it hasn't been accessed since a minute. Usually, the type pool cache is only used on startup. After your app has warmed up, the cache should be automatically cleared. Also, the cache is referenced via SoftReferences so that they get cleared automatically if the JVM heap usage approaches the limit.

As @felixbarny noted, this cache should not cause OOM, and it should not consume any heap after some up time of the application.
We think we know why this wasn't behaving as expected. Please try to use this fix snapshot, without the enable_type_pool_cache config, and let us know if this resolves the issue.
Thanks!

Version 1.24.0 has been released with this fix.

Thanks @Eyal_Koren @felixbarny . I will try this and will update on Consuming Heap Part.