Elastic-apm-agent jdbcHelper seems to use a lot of memory

Firstly, I'm not sure how i replicate this problem apologies in advance...

Looking into memory usage on an app at the moment because it seems to be using a bunch more memory than it should be...

Drilling in with the profiling tool, the main usage seems to be coming from a ConcurrentHashMap in co.elastic.apm.agent.jdbc.helper.jdbcHelper.

Inside that class is a static list maintaining all sql ever run by the looks...

This list will grow and grow and none of the referenced objects can ever be garbage collected (referenced by a static object), right?

If there was a TTL or something on the list, or something removed queries at some stage, i think this would help the memory footprint a lot. Our current co.elastic.apm.agent.jdbc.helper.JdbcHelper retains nearly 8GB of memory...

APM Agent language and version:
java - 1.8.0

** $ java -version

java version "1.8.0_161"

Java(TM) SE Runtime Environment (build 1.8.0_161-b12)

Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

Hi Paul and welcome to the APM forum :slight_smile:

This problem you report is troubling...

A short explanation: the JdbcHelper holds a WeakConcurrentMap, not a ConcurrentHashMap. The difference is that the former uses weakly-referenced keys, which means it should not prevent Statements from being garbage collected. We don't actively remove entries from this map, as we rely on the map to remove them once (or shortly after) the Statement objects (used as keys) are GCed.

In order to explore what happens there further, it will be useful to get some more info:

Just to make sure- does this mean it behaves differently with and without agent? Did you see a difference between former agent version and 1.8.0?

There are generally two ways to explain these symptoms: either there is a bug in the map implementation or the way we use it, or something other than the map is causing this statement leak, and you see that in the map, because the statements do not get GCed.

Using your heap-analysis tool, please try to look at map entries of statements that shouldn't be there (meaning - should have been cleared already) and see:

  1. The key should be of type com.blogspot.mydailyjava.weaklockfree.WeakConcurrentMap.WeakKey, which extends WeakReference. Is the referent field of the key null?
  2. If referent is not null, try to see what else if referencing it. That would be the most useful thing we can have for further analysis.

I'll do some digging, thanks for the information, Eyal. If I don't get to it tomorrow, i'll have some services that have run a few days without elastic_apm jdbc enabled, and that may shed some light if they show similar footprints (ie. it being an 'us' problem which i'm rather hoping for).
We did make a couple of changes at the same time, so I'm hoping I'm just blaming the wrong thing, and your great description may help narrow it down, so I really appreciate that.

If I don't reply tomorrow, I'll be replying on the flip side in the new working week!

Paul.

Sure, thanks! Looking forward to hear back

I can't see any other references :confused:



I don't do this kind of digging a huge amount, so i've included several screenshots (rather than wait on forum tag :))

Hopefully you can see something from these?

One 'interesting' (to me at least) thing is that there's a 'weak-ref-cleaner-2' thread in the second screenie listed as terminated.

1 Like

I think you are onto something here. Could you look into your logs (probably catalina.out) for things like Exception in thread and weak-ref-cleaner-2. Hopefully that reveals why the thread died. My suspicion is that an exception is thrown here: WeakConcurrentMap.java#L254 which terminates the thread because it's not caught.

Certainly looks like the cleaner thread death is the root cause- seems that Statement objects are garbage collected and the SQL strings are leaking:

Ok, I've gone diving through the logs, but there's not a lot to go on.

Every now and then I get a message like:

SEVERE: The web application [/v4##k00j132] appears to have started a thread named [weak-ref-cleaner-2] but has failed to stop it. This is very likely to create a memory leak.
Aug 14, 2019 12:30:02 AM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [/v4##k00j132] appears to have started a thread named [weak-ref-cleaner-3] but has failed to stop it. This is very likely to create a memory leak.
Aug 14, 2019 12:30:02 AM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesStopTimerThread

The time was suspicious, so i dug into our build system, and we clean up old applications daily.

When the old application gets un-deployed, a bunch of threads get killed, and it looks like the weak-ref-cleaner thread in that context is one of the victims.

The tomcat instance doesn't get restarted currently if there were apps to undeploy, maybe this memory is hanging around because its referenced from static memory from an old application instance that is no longer running?

https://stackoverflow.com/questions/4565105/is-there-a-way-to-avoid-undeployment-memory-leaks-in-tomcat seems pertinent to the whole scenario...

also this post: https://blog.georgovassilis.com/2014/01/15/tomcat-spring-and-memory-leaks-when-undeploying-or-redeploying-an-web-application/

It sounds like as a work around i may need to actually fully restart tomcat to cleanup any lingering memory when I undeploy, which has obvious issues in production :frowning_face:

Hmm, seems like Tomcat's memory leak prevention mechanism is causing a memory leak here :slight_smile:

Thanks for the great feedback, this is extremely valuable info!! Looking into ways of bypassing that.

@Paul_Harris1 I created a PR for handling this specific Tomcat issue.
Can you please download and test this snapshot build and see if it fixes the problem?

Another question: how do you start the agent? With -javaagent or with ElasticApmAttacher.attach()? I assume the latter? Where in your application do you call it?

with -javaagent on the command line...

I'll download the snapshot, it'll be hard to simulate the problem but I might be able to get a beta environment to play with without annoying too many people!

I no longer got the SEVERE message that the thread was dead, and I can't see any huge objects, but the application hasn't been running for a long time. It's fixed with the snapshot as near as I can tell...

@Paul_Harris1 seems like there is no need to run the application for a long time, only to undeploy/redeploy it and see if you get the SEVERE message and whether the cleaner thread dies. However, this will be reproduced only if the first application using JDBC will be undeployed, so probably better to try with one app...
I think that the JVM needs to be ran with a special flag allowing Tomcat to stop such threads as well, but not sure about that.

Does this mean you see a different behaviour comparing 1.8.0 to the 1.8.1-SNAPSHOT on the same environment?

Ok, so I was wrong :confused: . I realised i was deploying but not undeploying old apps, so the old instances were still present.

On an undeploy, I do still get the SEVERE 'thread named weak-ref-cleaner' message...

I'm assuming that means that we'd ultimately get the memory leaks...

Looks like it...
Please specify the exact Tomcat version you are using.
And add the exact related log message.

Strange, I looked in all three branches - 7.0.x, 8.5.x and master (where 9.0.x releases are tagged from) and this fix should be valid for all.

@Paul_Harris1 I just need a confirmation that using the provided snapshot build produces the same results, and if so- in which Tomcat version, what scenario and what is the exact log message.

Thanks a lot, your feedback was extremely valuable so far!!

catalina.out:

SEVERE: The web application [/v4##k000342] appears to have started a thread named [weak-ref-cleaner-3] but has failed to stop it. This is very likely to create a memory leak.
Aug 20, 2019 3:38:07 PM org.apache.catalina.loader.WebappClassLoaderBase loadClass

tomcat packages installed

[root@ip-10-0-4-251 logs]# yum list installed |grep tomcat
tomcat-native.x86_64                  1.2.21-1.22.amzn1            @amzn-updates
tomcat7.noarch                        7.0.94-1.35.amzn1            @amzn-updates
tomcat7-admin-webapps.noarch          7.0.94-1.35.amzn1            @amzn-updates
tomcat7-el-2.2-api.noarch             7.0.94-1.35.amzn1            @amzn-updates
tomcat7-jsp-2.2-api.noarch            7.0.94-1.35.amzn1            @amzn-updates
tomcat7-lib.noarch                    7.0.94-1.35.amzn1            @amzn-updates
tomcat7-servlet-3.0-api.noarch        7.0.94-1.35.amzn1            @amzn-updates
[root@ip-10-0-4-251 logs]# uname -a

Linux ip-10-0-4-251 4.14.133-88.112.amzn1.x86_64 #1 SMP Tue Jul 30 21:21:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

tomcat7.conf elastic-apm configuration and java args

ELASTIC_APM_SERVICE_NAME=v4-beta
ELASTIC_APM_APPLICATION_PACKAGES=---snip---
ELASTIC_APM_SERVER_URLS=---snip---
ELASTIC_APM_TRACE_METHODS=com.pds.*,com.mapshed.*
# ELASTIC_APM_DISABLE_INSTRUMENTATIONS=jdbc
ELASTIC_APM_AGENT=/usr/share/tomcat7/elasticapm/elastic-apm-agent-1.8.1-SNAPSHOT.jar

JAVA_OPTS="-server -Xms2673m -Xmx2673m -Xss256k -XX:NewRatio=4 -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=15 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+ScavengeBeforeFullGC -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:CMSInitiatingOccupancyFraction=30 -XX:+CMSClassUnloadingEnabled -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.password.file=/usr/share/tomcat7/conf/jmxremote.password -Dcom.sun.management.jmxremote.access.file=/usr/share/tomcat7/conf/jmxremote.access -Denvironment=beta -javaagent:${ELASTIC_APM_AGENT} -Dpds.hazelcast.deployment.type=beta "

just making sure the java version is the same on this machine...

[root@ip-10-0-4-251 logs]# java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

md5sum of the snapshot used

[root@ip-10-0-4-251 logs]# md5sum /usr/share/tomcat7/elasticapm/elastic-apm-agent-1.8.1-SNAPSHOT.jar

2410859c508beaa2e97fdf9e293fe45e /usr/share/tomcat7/elasticapm/elastic-apm-agent-1.8.1-SNAPSHOT.jar

[root@ip-10-0-4-251 logs]#

Sorry I was just on my way out the door yesterday :confused: lost another day

Ahh, I think I know what is going on here...
I believe we did solve the original leak, but we use this map elsewhere. Should have realized that by the thread's name (ending with a number larger than 1).
I double checked the exact tomcat version (7.0.94) and I believe this fix should be valid.
Preparing and building a quick fix and uploading for test.

Well, the fix was quick, but the build took time to go through..
Here, please try this snapshot and let me know what happens.
Thanks a lot!!