Firstly, I'm not sure how i replicate this problem apologies in advance...
Looking into memory usage on an app at the moment because it seems to be using a bunch more memory than it should be...
Drilling in with the profiling tool, the main usage seems to be coming from a ConcurrentHashMap in co.elastic.apm.agent.jdbc.helper.jdbcHelper.
Inside that class is a static list maintaining all sql ever run by the looks...
This list will grow and grow and none of the referenced objects can ever be garbage collected (referenced by a static object), right?
If there was a TTL or something on the list, or something removed queries at some stage, i think this would help the memory footprint a lot. Our current co.elastic.apm.agent.jdbc.helper.JdbcHelper retains nearly 8GB of memory...
APM Agent language and version:
java - 1.8.0
** $ java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
A short explanation: the JdbcHelper holds a WeakConcurrentMap, not a ConcurrentHashMap. The difference is that the former uses weakly-referenced keys, which means it should not prevent Statements from being garbage collected. We don't actively remove entries from this map, as we rely on the map to remove them once (or shortly after) the Statement objects (used as keys) are GCed.
In order to explore what happens there further, it will be useful to get some more info:
Just to make sure- does this mean it behaves differently with and without agent? Did you see a difference between former agent version and 1.8.0?
There are generally two ways to explain these symptoms: either there is a bug in the map implementation or the way we use it, or something other than the map is causing this statement leak, and you see that in the map, because the statements do not get GCed.
Using your heap-analysis tool, please try to look at map entries of statements that shouldn't be there (meaning - should have been cleared already) and see:
The key should be of type com.blogspot.mydailyjava.weaklockfree.WeakConcurrentMap.WeakKey, which extends WeakReference. Is the referent field of the key null?
If referent is not null, try to see what else if referencing it. That would be the most useful thing we can have for further analysis.
I'll do some digging, thanks for the information, Eyal. If I don't get to it tomorrow, i'll have some services that have run a few days without elastic_apm jdbc enabled, and that may shed some light if they show similar footprints (ie. it being an 'us' problem which i'm rather hoping for).
We did make a couple of changes at the same time, so I'm hoping I'm just blaming the wrong thing, and your great description may help narrow it down, so I really appreciate that.
If I don't reply tomorrow, I'll be replying on the flip side in the new working week!
I think you are onto something here. Could you look into your logs (probably catalina.out) for things like Exception in thread and weak-ref-cleaner-2. Hopefully that reveals why the thread died. My suspicion is that an exception is thrown here: WeakConcurrentMap.java#L254 which terminates the thread because it's not caught.
Ok, I've gone diving through the logs, but there's not a lot to go on.
Every now and then I get a message like:
SEVERE: The web application [/v4##k00j132] appears to have started a thread named [weak-ref-cleaner-2] but has failed to stop it. This is very likely to create a memory leak.
Aug 14, 2019 12:30:02 AM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [/v4##k00j132] appears to have started a thread named [weak-ref-cleaner-3] but has failed to stop it. This is very likely to create a memory leak.
Aug 14, 2019 12:30:02 AM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesStopTimerThread
The time was suspicious, so i dug into our build system, and we clean up old applications daily.
When the old application gets un-deployed, a bunch of threads get killed, and it looks like the weak-ref-cleaner thread in that context is one of the victims.
The tomcat instance doesn't get restarted currently if there were apps to undeploy, maybe this memory is hanging around because its referenced from static memory from an old application instance that is no longer running?
It sounds like as a work around i may need to actually fully restart tomcat to cleanup any lingering memory when I undeploy, which has obvious issues in production
Another question: how do you start the agent? With -javaagent or with ElasticApmAttacher.attach()? I assume the latter? Where in your application do you call it?
I'll download the snapshot, it'll be hard to simulate the problem but I might be able to get a beta environment to play with without annoying too many people!
I no longer got the SEVERE message that the thread was dead, and I can't see any huge objects, but the application hasn't been running for a long time. It's fixed with the snapshot as near as I can tell...
@Paul_Harris1 seems like there is no need to run the application for a long time, only to undeploy/redeploy it and see if you get the SEVERE message and whether the cleaner thread dies. However, this will be reproduced only if the first application using JDBC will be undeployed, so probably better to try with one app...
I think that the JVM needs to be ran with a special flag allowing Tomcat to stop such threads as well, but not sure about that.
Does this mean you see a different behaviour comparing 1.8.0 to the 1.8.1-SNAPSHOT on the same environment?
Strange, I looked in all three branches - 7.0.x, 8.5.x and master (where 9.0.x releases are tagged from) and this fix should be valid for all.
@Paul_Harris1 I just need a confirmation that using the provided snapshot build produces the same results, and if so- in which Tomcat version, what scenario and what is the exact log message.
Thanks a lot, your feedback was extremely valuable so far!!
SEVERE: The web application [/v4##k000342] appears to have started a thread named [weak-ref-cleaner-3] but has failed to stop it. This is very likely to create a memory leak.
Aug 20, 2019 3:38:07 PM org.apache.catalina.loader.WebappClassLoaderBase loadClass
Ahh, I think I know what is going on here...
I believe we did solve the original leak, but we use this map elsewhere. Should have realized that by the thread's name (ending with a number larger than 1).
I double checked the exact tomcat version (7.0.94) and I believe this fix should be valid.
Preparing and building a quick fix and uploading for test.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.