Long "Stop the world" breaks in Tomcat, monitored by apm-java-agent

We have ELK-stack with APM server in the Cloud: cloud.elastic.co (APM Server 8.9.1)
We are monitoring our production application servers (several Tomcats behind HAproxy) by apm-java-agent (v 1.42) and we (and unfortunately our customers also) are detecting 30-40s application stops (Stop-the-world effects). We have jvm logs like this:

We have figured out that this redefining classes is happening through instrumentation and somehow coming from the agent (directly).

How to figure out why this happen? Can we change some settings and fine tuning the apm-java-agent (or and JVM/Tomcat)?
Now we have disabled apm-java-agents, but we would like to use it because it is serving us with lots of important information.

The agent source code has some kind of maven settings for maven-shade-plugin: https://github.com/elastic/apm-agent-java/blob/e28ad36dfb8960760314af613d624e1a00ac4cb2/elastic-apm-agent/pom.xml#L276
Can-Redefine-Classes: true, Can-Transform-Classes: true

Is that something that can be switched off?

Hi @Bela_Borbely , thanks for reporting this.

Can you capture a thread dump when this issue happens ?
This should help narrow down which classes are instrumented or potential causes.

The maven shade plugin settings are not related to what you experience here, it just provides the ability redefine classes for the agent, here the problem is more "why does the agent makes the JVM spend too much time redefining classes".

Thank you for your answer!
So "redefining classes" operation is necessary for the agent, we cannot just switch off?
I don't know how to capture a thread dump at the moment when the issue happens. It happens accidentally.

We have found a quite similar situation described by Okta Developers:

Yes, the "redefining classes" is necessary for the agent, it's not something we can switch off.

In order to capture thread dumps, you can use the jcmd or send the SIGQUIT signal, see docs for more details.

This looks definitely close to what is described in the Okta blogpost, but as far as I understand there isn't any clue about:

  • which agent is actually used, there are multiple vendors in that space and we don't even know if it's elastic one or not
  • which part of the instrumentation did they had to disable

So here I think the best way forward is to investigate what is actually happening to your environment and applications, and for that capturing the thread dump is the best option.

From that, we should at least be able to identify which part of the agent instrumentation is causing this and we can then use the disable_instrumentations configuration option to implement a work-around similar to the okta blogpost.

As we have analyzed the Tomcat logs, we saw some kind of correlation between an error log and RedefineClasses vm-operation:

Not every RedefineClasses log is followed by this "Error sending data to APM server"-error, but all "Error sending data to APM server" error has a "RedefineClasses" row before.

Hard to catch the "RedefineClasses"-moment, because it is accidental, we can just realize it afterward. Do you have an idea, how to catch that moment "automatically" / programmatically?

The exception shown here comes from the response of APM server (or any intermediate proxy), and this happens in a dedicated thread elastic-apm-server-reporter, which is not an application thread, thus it should not impact the whole JVM. Data serialization and sending is asynchronous and does not impact the application.

  • could you share your agent configuration (and redact the sensitive items) ?
  • are there any HTTP proxies involved between the application/agent and the apm-server ? Sending data to apm-server is very likely not the cause of the excessive time for RedefineClasses.
  • from the output, I understand that you have roughly ~1500 threads on the JVM, is that usual and expected in our environment ? Such high thread count seems a bit high and could explain why that would make the RedefineClasses take longer than expected.

I need to research a bit on my side how we could investigate this further.

Capturing the agent log in debug log level would likely give us a few hints at what are the classes currently being instrumented, if you could capture it using this procedure and send the log file in a private message that would likely be helpful.

When the agent (and the JVM) starts, it also logs the effective agent configuration so it would also provide it.

Here the HTTP communication issue is very likely more a symptom of the JVM stop-the-world pause rather than the cause of it.

@Sylvain_Juge: I have sent you a private message with the logs.

1500 threads are "normal" for our environment because Tomcat has about 60 contexts/applications and each starts 5-10 different threads (Quartz for scheduling background tasks and LaunchDarkly for feature flags).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.