Long "Stop the world" breaks in Tomcat, monitored by apm-java-agent

Bela_Borbely · September 23, 2023, 4:08pm

We have ELK-stack with APM server in the Cloud: cloud.elastic.co (APM Server 8.9.1)
We are monitoring our production application servers (several Tomcats behind HAproxy) by apm-java-agent (v 1.42) and we (and unfortunately our customers also) are detecting 30-40s application stops (Stop-the-world effects). We have jvm logs like this:

We have figured out that this redefining classes is happening through instrumentation and somehow coming from the agent (directly).

How to figure out why this happen? Can we change some settings and fine tuning the apm-java-agent (or and JVM/Tomcat)?
Now we have disabled apm-java-agents, but we would like to use it because it is serving us with lots of important information.

Bela_Borbely · September 24, 2023, 3:39pm

The agent source code has some kind of maven settings for maven-shade-plugin: https://github.com/elastic/apm-agent-java/blob/e28ad36dfb8960760314af613d624e1a00ac4cb2/elastic-apm-agent/pom.xml#L276
Can-Redefine-Classes: true, Can-Transform-Classes: true

Is that something that can be switched off?

Sylvain_Juge · September 25, 2023, 7:46am

Hi @Bela_Borbely , thanks for reporting this.

Can you capture a thread dump when this issue happens ?
This should help narrow down which classes are instrumented or potential causes.

The maven shade plugin settings are not related to what you experience here, it just provides the ability redefine classes for the agent, here the problem is more "why does the agent makes the JVM spend too much time redefining classes".

Bela_Borbely · September 25, 2023, 8:32am

Thank you for your answer!
So "redefining classes" operation is necessary for the agent, we cannot just switch off?
I don't know how to capture a thread dump at the moment when the issue happens. It happens accidentally.

Bela_Borbely · September 25, 2023, 8:50am

We have found a quite similar situation described by Okta Developers:
https://developer.okta.com/blog/2019/05/28/debugging-jvm-performance-problems-java

Sylvain_Juge · September 25, 2023, 9:25am

Yes, the "redefining classes" is necessary for the agent, it's not something we can switch off.

In order to capture thread dumps, you can use the jcmd or send the SIGQUIT signal, see docs for more details.

This looks definitely close to what is described in the Okta blogpost, but as far as I understand there isn't any clue about:

which agent is actually used, there are multiple vendors in that space and we don't even know if it's elastic one or not
which part of the instrumentation did they had to disable

So here I think the best way forward is to investigate what is actually happening to your environment and applications, and for that capturing the thread dump is the best option.

From that, we should at least be able to identify which part of the agent instrumentation is causing this and we can then use the disable_instrumentations configuration option to implement a work-around similar to the okta blogpost.

Bela_Borbely · September 25, 2023, 11:29am

As we have analyzed the Tomcat logs, we saw some kind of correlation between an error log and RedefineClasses vm-operation:

Not every RedefineClasses log is followed by this "Error sending data to APM server"-error, but all "Error sending data to APM server" error has a "RedefineClasses" row before.

Hard to catch the "RedefineClasses"-moment, because it is accidental, we can just realize it afterward. Do you have an idea, how to catch that moment "automatically" / programmatically?

Sylvain_Juge · September 25, 2023, 12:13pm

The exception shown here comes from the response of APM server (or any intermediate proxy), and this happens in a dedicated thread elastic-apm-server-reporter, which is not an application thread, thus it should not impact the whole JVM. Data serialization and sending is asynchronous and does not impact the application.

could you share your agent configuration (and redact the sensitive items) ?
are there any HTTP proxies involved between the application/agent and the apm-server ? Sending data to apm-server is very likely not the cause of the excessive time for RedefineClasses.
from the output, I understand that you have roughly ~1500 threads on the JVM, is that usual and expected in our environment ? Such high thread count seems a bit high and could explain why that would make the RedefineClasses take longer than expected.

I need to research a bit on my side how we could investigate this further.

Sylvain_Juge · September 25, 2023, 12:28pm

Capturing the agent log in debug log level would likely give us a few hints at what are the classes currently being instrumented, if you could capture it using this procedure and send the log file in a private message that would likely be helpful.

When the agent (and the JVM) starts, it also logs the effective agent configuration so it would also provide it.

Here the HTTP communication issue is very likely more a symptom of the JVM stop-the-world pause rather than the cause of it.

Bela_Borbely · September 26, 2023, 11:34am

@Sylvain_Juge: I have sent you a private message with the logs.

1500 threads are "normal" for our environment because Tomcat has about 60 contexts/applications and each starts 5-10 different threads (Quartz for scheduling background tasks and LaunchDarkly for feature flags).

system · October 24, 2023, 11:34am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
JVM crashed after upgrade to apm java agent 1.19 APM java	17	1313	January 18, 2021
APM agent for Java is causing system crashes APM	9	1240	March 8, 2019
Monitoring of an application on a tomcat APM java	9	1360	June 15, 2021
ElasticSearch Java Agent for tomcat stops my application from running APM	18	2201	December 13, 2018
JVM crash with JRE 7 and profiling_inferred_spans_enabled APM java	6	491	March 16, 2021

Long "Stop the world" breaks in Tomcat, monitored by apm-java-agent

Related topics