JVM crashed after upgrade to apm java agent 1.19

Kibana version: 7.10
Elasticsearch version: 7.10
APM Server version: 7.10
APM Agent language and version: Java APM 1.19
Fresh install or upgraded from other version? Upgraded from Java APM agent 1.15 to 1.19

Problem:
I upgraded an elastic stack + apm server (that is used for testing) to the latest ELK + APM version across the board: 7.10 and 1.19. The issue here lies specifically with the Elastic Java APM agent 1.19 upgrade. We upgrade from 1.15 to 1.19 and started to see JVM's crashing after 5+ minutes.

Setup:

  • os: CentOS Linux release 7.8.2003
  • java app: tomcat 7.0.76-12
  • java version: jdk1.7.0_80
  • java apm version: 1.19

Provide logs and/or server output (if relevant):
There are no logs that describe what is happening that I can tell, even with org.apache / apm debug turned to trace logging. I do have a crash report that I can provide. Any assistance is appreciated. This issue did not occur on apm java version 1.15.

Hi @jcotter91, welcome to our forum.

I'm really sorry to hear that upgrading the agent made crash your application.

Can you send us your crash report here ? (please redact any sensitive JVM parameters it could contain).

Here you go.

Could you please also share your configuration?

From the crash report I can infer that you might have set log_level to TRACE if so, please set to DEBUG or INFO.

Another tip would be to update to a more recent update version of Java 7 if that's feasible.

Also, the crash report indicates that it crashed after about an ~30min as we can see at the bottom of the file:

elapsed time: 1772 seconds

You mentioned it crashed within a few minutes after updating the agent, does it means that more than one JVM crashed with a similar error ? If yes, then crash reports would help to see if it's exactly the same issue here.
Those crashes tend to be hard to reproduce and diagnose, thus the more info we have about it, the better.

Could you please also share your configuration?

Are you just looking for the JVM startup arguments? I am not pointing to external configuration for the apm agent. Below is what I am setting for configuration. I have also tried removing most of the -D arguments to see if it was a specific one and the JVM still ends up crashing.

/usr/java/jdk1.7.0_80/bin/java -server -Xmx3072m -Xms2048m -XX:MaxPermSize=256m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dsetup.restAccessLogDir=/opt/tomcat2/logs -DadvancedLogging=true -Dfr.json.interposer.disabled=true -Dfr.mixin.http.client.entity=false -Dfr.mixin.http.client.request=false -Dfr.mixin.http.client.response=false -javaagent:/opt/setup/elastic-apm-agent.jar -Delastic.apm.application_packages=com.setup -Delastic.apm.server_urls=http://logapm1.new.setup.com:8200,http://logapm2.new.setup.com:8200 -Delastic.apm.environment=189 -Delastic.apm.capture_body=all -Delastic.apm.capture_headers=true -Delastic.apm.service_name=web -Delastic.apm.service_node_name=tomcat2 -Delastic.apm.global_labels=server_type=web,hostname=web2.new.setup.com,location=setup,datacenter=new,swimlane=isl -Delastic.apm.span_min_duration=0ms -DstatsLoggingPrefix=isl -DstatsLoggingHost=localhost -DstatsLoggingPort=8125 -classpath /usr/share/tomcat/bin/bootstrap.jar:/usr/share/tomcat/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar -Dcatalina.base=/opt/tomcat2 -Dcatalina.home=/usr/share/tomcat -Djava.endorsed.dirs=/usr/share/tomcat/lib/endorsed -Djava.io.tmpdir=/opt/tomcat2/temp -Djava.util.logging.config.file=/usr/share/tomcat/lib/log4j.xml -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start

From the crash report I can infer that you might have set log_level to TRACE if so, please set to DEBUG or INFO .

I only turned on the TRACE level on the APM to see if there was additional information I could debug with. It was off originally, then set to DEBUG, and then TRACE. I can set this to DEBUG. Would you like to look at those logs?

Another tip would be to update to a more recent update version of Java 7 if that's feasible.

I cannot upgrade this application Java version at this time. I do have Java 8 (jdk1.8.0_51) version applications running with the APM version 1.19 that do not seem to be experiencing this issue.

https://www.elastic.co/guide/en/apm/agent/java/master/release-notes-1.x.html#release-notes-1.18.0.rc1

As early versions of Java 7 and Java 8 have unreliable support for invokedynamic, we now require a minimum update level of 60 for Java 7 (7u60+) in addition to the existing minimum update level of 40 for Java 8 (8u40+).

Per the documentation, it does seem like I should be able to use this version, at least on 1.18.0RC1. I have not tested this apm version yet.

Also, the crash report indicates that it crashed after about an ~30min as we can see at the bottom of the file:
elapsed time: 1772 seconds
You mentioned it crashed within a few minutes after updating the agent, does it means that more than one JVM crashed with a similar error ? If yes, then crash reports would help to see if it's exactly the same issue here.
Those crashes tend to be hard to reproduce and diagnose, thus the more info we have about it, the better.

Yes there have been more than one crash on different JVM's running different code. The 5+ minutes is probably more 15-30 minutes like you see above. I can get the crash dump from a JVM where I believe it crashed faster than 30 minutes. I also noticed that all the JVM's where this crashed are on jdk1.7.0_80. I am also noticing other JVM's I have running jdk1.7.0_80 hit 100% of available core CPU rather than crashing, creating socket timeouts.

Thanks for the crash reports.

Many of them indicate that the crash occurred shortly after a JIT compilation event of our co.elastic.apm.agent.servlet.ServletApiAdvice#onEnterServletService method. Moreover, the crash occurred while the executing method stack frame stored a co.elastic.apm.agent.impl.transaction.Transaction object in its RAX register, which typically stores the return values in x86 architectures. This fits the onEnterServletService return type, so although it may be coincidental, let's assume that the crash occurs due to faulty native code produced by this last JIT event.

In order to test that, I would ask you to disable JIT compilation for this method specifically, relying on the CompileCommand option.
Please try the following:

  1. run java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal -version using the same Java that runs the application and verify the printed output contains CompileCommand and LogCompilation, meaning these options are available.
  2. append the following to the command line that starts the application: -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=exclude,co/elastic/apm/agent/servlet/ServletApiAdvice.onEnterServletService -XX:+PrintCompilation -XX:+LogCompilation -XX:LogFile=<path/to/compilation.log> (adjust the compilation log path to a real path) and try to reproduce.
  3. if the crash is reproduced with this addition, please provide the compilation.log file and the crash report.

There are a couple of crash reports that differ, but let's start with ignoring those.
Looking forward to hear back.
Thanks again!

CompileCommand + LogCompilation both are available.

I have restarted a JVM that crashes with the ARGs provided. Will report back with compilation.log and any crash reports in an hour or so.

So after about 4 hours the JVM still has not crashed after setting the above ARGs for the tomcat instance.

@jcotter91 these are great news!

Please let us know once you are confident this setting is a sufficient resolution, so we can document for other users of the same JVM version.

If you do get a crash, please provide the /opt/tomcat1/logs/compilation.log and crash report so we can analyse further.

Lastly, you can try a more recent update of Java 7 (e.g. Zulu OpenJDK), without the compilation exclusion arg. This JIT error should be fixed in later releases. If you do try that, we will be happy to hear about it.

Do I lose any apm data by excluding this piece?
-XX:CompileCommand=exclude,co/elastic/apm/agent/servlet/ServletApiAdvice.onEnterServletService

No, not at all. This only prevents JIT compilation for a single method, as it seems to produce buggy native code in the JVM you are using. The implication is lack of optimisation for this method, which I believe is far too small for you to be able to measure.

If that is the case we can close this. I have not seen any crashes since this change. If I have any follow up around this I will reopen / create a new ticket if I find something different.

Thank you for your assistance!

1 Like

Hi @jcotter91,

We have managed to reproduce a very similar issue as reported in https://github.com/elastic/apm-agent-java/issues/1583.

By chance, do you happen to have a compilation.log file without the -XX:CompileCommand ? If you do, that would help us to identify what is the underlying cause so we could have a better work-around.

I don't have one at the moment, but could get you one in the next day or two.

1 Like

@jcotter91 we are trying to find the most general form of C2 compilation exclusion we need to set in order for this issue and similar ones on Java 7 to be resolved.
Please try to replace the former CompileCommand with a new one: -XX:CompileCommand=exclude,java.lang.invoke.LambdaForm*::* and see if it resolves the issue as well. Your input on that will be invaluable for us.
Thanks!

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.