We've been experiencing a SIGSEV JVM crash on our production cluster which seems to originate from the APM Java agent. We haven't been able to establish a pattern of when this happens or what triggers this. The second instance of this error happened after a week, the third one after it had been running for 2 days.
We're running on a eclipse-temurin:17-alpine docker image with APM agent version 1.36.
The error we see is:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f269de9b474, pid=1, tid=600
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.7+7 (17.0.7+7) (build 17.0.7+7)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.7+7 (17.0.7+7, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0xe34474] ObjectSynchronizer::FastHashCode(Thread*, oopDesc*)+0x184
I think you did the right thing by opening an issue on the JVM as the problem seems to be within JVM code.
We can see from the stack trace that the Elastic APM agent calls java.lang.System.identityHashCode , which is implemented by the JVM in native code.
There are a few things that aren't visible in the crash report though, could you please provide more details on:
how does the agent is setup ? we don't see the -javaagent in the JVM arguments
can you provide the agent configuration (without sensitive values) ?
We attach programatically using ElasticApmAttacher.attach();
This is our config
2023-06-29 21:37:56,849 [Attach Listener] INFO co.elastic.apm.agent.configuration.StartupInfo - Starting Elastic APM 1.36.0 as ** (23.12.0) on Java 17.0.7 Runtime version: 17.0.7+7 VM version: 17.0.7+7 (Eclipse Adoptium) Linux 5.10.162+
23:37:56.850
2023-06-29 21:37:56,850 [Attach Listener] INFO co.elastic.apm.agent.configuration.StartupInfo - service_name: '**' (source: /configuration/elasticapm.properties)
23:37:56.850
2023-06-29 21:37:56,850 [Attach Listener] INFO co.elastic.apm.agent.configuration.StartupInfo - environment: 'NL-PROD' (source: /configuration/elasticapm.properties)
23:37:56.851
2023-06-29 21:37:56,850 [Attach Listener] INFO co.elastic.apm.agent.configuration.StartupInfo - disable_instrumentations: 'log-reformatting' (source: /configuration/elasticapm.properties)
23:37:56.851
2023-06-29 21:37:56,851 [Attach Listener] INFO co.elastic.apm.agent.configuration.StartupInfo - config_file: '/configuration/elasticapm.properties' (source: Environment Variables)
23:37:56.851
2023-06-29 21:37:56,851 [Attach Listener] INFO co.elastic.apm.agent.configuration.StartupInfo - plugins_dir: '/' (source: /configuration/elasticapm.properties)
23:37:56.851
2023-06-29 21:37:56,851 [Attach Listener] INFO co.elastic.apm.agent.configuration.StartupInfo - secret_token: 'XXXX' (source: /configuration/elasticapm.properties)
23:37:56.851
2023-06-29 21:37:56,851 [Attach Listener] INFO co.elastic.apm.agent.configuration.StartupInfo - server_urls: '**' (source: /configuration/elasticapm.properties)
23:37:56.851
2023-06-29 21:37:56,851 [Attach Listener] INFO co.elastic.apm.agent.configuration.StartupInfo - application_packages: 'com.h4h' (source: /configuration/elasticapm.properties)
We do not use the sampling profiler
So far we've had 3 of these crashes, between the first and second one there was a week in between. Between the second and third there were two days in between.
As you are not using the sampling profiler, there is no native code involved by the Java agent that could interfere with the JVM code, so here we need to wait for any feedback from the JVM team.
I would suggest collecting and keeping all the crash reports when it happens, as if there is any correlation between those it would be very valuable to have multiple instances.
We've just had another crash on our production environment because of this issue. The weird thing is it's still happening in JulConsoleHandlerPublishAdvice even though we've disabled the log-reformatting instrumentation. If there is anyway to fully disable this JulConsoleHandlerPublishAdvice it would be very much appreciated, otherwise we might need to disable the whole agent for now.
It seems like that for some reason disabling log-reformatting group does not work. Disabling the whole logging group does result in Not applying excluded instrumentation co.elastic.apm.agent.jul.reformatting.JulLogReformattingInstrumentation$ConsoleReformattingInstrumentation
I think the inconsistency related to disabling this instrumentation has been removed in version 1.37.0 with set service name & version in ecs-logging by SylvainJuge · Pull Request #3064 · elastic/apm-agent-java · GitHub (you are using 1.36.0), so here the best option should probably be to upgrade to the latest version (1.39.0 as I'm writing this). I haven't checked the exact version you are using here but updating to the latest is usually a good idea.
However, updating to the latest version should not change the problem you have with crashing JVMs, but at least you should be able to apply a better work-around.
Can confirm that indeed with a newer version log-reformatting works and with 1.36 jul-ecs works. It might be a good idea (for the future) to add some note to the documentation if something like this changes. I see it's listed as breaking change in the changelog for 1.37, but looking at the documentation for 1.x (Core configuration options | APM Java Agent Reference [1.x] | Elastic) there is no indication whatsoever the naming of these groups is different between minor versions.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.