Kibana version: 7.10 Elasticsearch version: 7.10 APM Server version: 7.10 APM Agent language and version: Java APM 1.19 Fresh install or upgraded from other version? Upgraded from Java APM agent 1.15 to 1.19
Problem:
I upgraded an elastic stack + apm server (that is used for testing) to the latest ELK + APM version across the board: 7.10 and 1.19. The issue here lies specifically with the Elastic Java APM agent 1.19 upgrade. We upgrade from 1.15 to 1.19 and started to see JVM's crashing after 5+ minutes.
Setup:
os: CentOS Linux release 7.8.2003
java app: tomcat 7.0.76-12
java version: jdk1.7.0_80
java apm version: 1.19
Provide logs and/or server output (if relevant):
There are no logs that describe what is happening that I can tell, even with org.apache / apm debug turned to trace logging. I do have a crash report that I can provide. Any assistance is appreciated. This issue did not occur on apm java version 1.15.
Also, the crash report indicates that it crashed after about an ~30min as we can see at the bottom of the file:
elapsed time: 1772 seconds
You mentioned it crashed within a few minutes after updating the agent, does it means that more than one JVM crashed with a similar error ? If yes, then crash reports would help to see if it's exactly the same issue here.
Those crashes tend to be hard to reproduce and diagnose, thus the more info we have about it, the better.
Are you just looking for the JVM startup arguments? I am not pointing to external configuration for the apm agent. Below is what I am setting for configuration. I have also tried removing most of the -D arguments to see if it was a specific one and the JVM still ends up crashing.
From the crash report I can infer that you might have set log_level to TRACE if so, please set to DEBUG or INFO .
I only turned on the TRACE level on the APM to see if there was additional information I could debug with. It was off originally, then set to DEBUG, and then TRACE. I can set this to DEBUG. Would you like to look at those logs?
Another tip would be to update to a more recent update version of Java 7 if that's feasible.
I cannot upgrade this application Java version at this time. I do have Java 8 (jdk1.8.0_51) version applications running with the APM version 1.19 that do not seem to be experiencing this issue.
https://www.elastic.co/guide/en/apm/agent/java/master/release-notes-1.x.html#release-notes-1.18.0.rc1
As early versions of Java 7 and Java 8 have unreliable support for invokedynamic, we now require a minimum update level of 60 for Java 7 (7u60+) in addition to the existing minimum update level of 40 for Java 8 (8u40+).
Per the documentation, it does seem like I should be able to use this version, at least on 1.18.0RC1. I have not tested this apm version yet.
Also, the crash report indicates that it crashed after about an ~30min as we can see at the bottom of the file:
elapsed time: 1772 seconds
You mentioned it crashed within a few minutes after updating the agent, does it means that more than one JVM crashed with a similar error ? If yes, then crash reports would help to see if it's exactly the same issue here.
Those crashes tend to be hard to reproduce and diagnose, thus the more info we have about it, the better.
Yes there have been more than one crash on different JVM's running different code. The 5+ minutes is probably more 15-30 minutes like you see above. I can get the crash dump from a JVM where I believe it crashed faster than 30 minutes. I also noticed that all the JVM's where this crashed are on jdk1.7.0_80. I am also noticing other JVM's I have running jdk1.7.0_80 hit 100% of available core CPU rather than crashing, creating socket timeouts.
Many of them indicate that the crash occurred shortly after a JIT compilation event of our co.elastic.apm.agent.servlet.ServletApiAdvice#onEnterServletService method. Moreover, the crash occurred while the executing method stack frame stored a co.elastic.apm.agent.impl.transaction.Transaction object in its RAX register, which typically stores the return values in x86 architectures. This fits the onEnterServletService return type, so although it may be coincidental, let's assume that the crash occurs due to faulty native code produced by this last JIT event.
In order to test that, I would ask you to disable JIT compilation for this method specifically, relying on the CompileCommand option.
Please try the following:
run java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal -version using the same Java that runs the application and verify the printed output contains CompileCommand and LogCompilation, meaning these options are available.
append the following to the command line that starts the application: -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=exclude,co/elastic/apm/agent/servlet/ServletApiAdvice.onEnterServletService -XX:+PrintCompilation -XX:+LogCompilation -XX:LogFile=<path/to/compilation.log> (adjust the compilation log path to a real path) and try to reproduce.
if the crash is reproduced with this addition, please provide the compilation.log file and the crash report.
There are a couple of crash reports that differ, but let's start with ignoring those.
Looking forward to hear back.
Thanks again!
Please let us know once you are confident this setting is a sufficient resolution, so we can document for other users of the same JVM version.
If you do get a crash, please provide the /opt/tomcat1/logs/compilation.log and crash report so we can analyse further.
Lastly, you can try a more recent update of Java 7 (e.g. Zulu OpenJDK), without the compilation exclusion arg. This JIT error should be fixed in later releases. If you do try that, we will be happy to hear about it.
No, not at all. This only prevents JIT compilation for a single method, as it seems to produce buggy native code in the JVM you are using. The implication is lack of optimisation for this method, which I believe is far too small for you to be able to measure.
If that is the case we can close this. I have not seen any crashes since this change. If I have any follow up around this I will reopen / create a new ticket if I find something different.
By chance, do you happen to have a compilation.log file without the -XX:CompileCommand ? If you do, that would help us to identify what is the underlying cause so we could have a better work-around.
@jcotter91 we are trying to find the most general form of C2 compilation exclusion we need to set in order for this issue and similar ones on Java 7 to be resolved.
Please try to replace the former CompileCommand with a new one: -XX:CompileCommand=exclude,java.lang.invoke.LambdaForm*::* and see if it resolves the issue as well. Your input on that will be invaluable for us.
Thanks!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.