JVM crash with JRE 7 and profiling_inferred_spans_enabled

Hello. We're currently testing out Elastic APM in our java-based test environment for an application my team supports. One of the application components runs in a Java 7 environment. We initially encountered issue 1583, and applied the suggested -XX:CompileCommand to work around the issue.

However, whenever we set profiling_inferred_spans_enabled to true, shortly afterward, the JVM crashes. We are interested in being able to collect the extra span information provided profiling_inferred_spans_enabled, and are hoping there is a way to work through this. Thank you.

OS Version: CentOS 6.10
Java version: Oracle JRE 7.0_80-b15
APM Agent version: apm-java-agent 1.21.0

The apm-agent attaches to the apache tomcat process via the -javaagent flag. Default settings.


top of the hs_err_pid log:

# A fatal error has been detected by the Java Runtime Environment:
#  SIGSEGV (0xb) at pc=0x00007ffa751cb08b, pid=13985, tid=140712926508800
# JRE version: Java(TM) SE Runtime Environment (7.0_80-b15) (build 1.7.0_80-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.80-b11 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6ec08b]  JvmtiEnvBase::get_stack_trace(JavaThread*, int, int, _jvmtiFrameInfo*, int*)+0x21b
# Core dump written. Default location: /apps/tomcat8/sbtool/bin/core or core.13985
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp

---------------  T H R E A D  ---------------

Current thread (0x00007ffa7042f800):  JavaThread "Unknown thread" [_thread_blocked, id=14101, stack(0x00007ffa47eff000,0x00007ffa48000000)]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x0000000000f60115

RAX=0x0000000000f600f5, RBX=0x00007ffa7012ade0, RCX=0x0000003e2b8182a0, RDX=0x00007ffa7010f4f8
RSP=0x00007ffa47ff8c80, RBP=0x00007ffa47ff9300, RSI=0x00007ffa7010f4f0, RDI=0x0000000000000000
R8 =0x0000000000000000, R9 =0x0000000000000800, R10=0x0000000000000000, R11=0x0000000000000246
R12=0x00000000f4dd59a8, R13=0x0000000002782240, R14=0x00007ffa7010f4f0, R15=0x00007ffa58664d38
RIP=0x00007ffa751cb08b, EFLAGS=0x0000000000010246, CSGSFS=0x0000000000000033, ERR=0x0000000000000004

Hi @Charles_Porter ,

I am sorry for the inconvenience, and thanks for reporting this issue.
Do you think you could send us the full crash report (make sure that any sensitive environment variable or JVM command line parameter is removed) ?

Also, what is the frequency of the crashes that you observed ?

From the past JVM crashes related to JvmtiEnvBase::get_stack_trace we managed to work-around those stability issues on some JVMs by using the async_profiler_safe_mode=63 configuration parameter.

This (yet un-documented) configuration allows to make async-profiler avoid collecting some stack traces for extra safety. Please try with it and tell us if it makes any difference. Depending on the result, some trial & error might be required to properly tune the value and identify what is causing this within async-profiler.

Also, it's completely unrelated but there is a small typo in the config you have written here, verify_server_sert should probably be replaced by verify_server_cert.

Hi Sylvain_Juge,

Thank you for responding. Using async_profiler_safe_mode=63 seems to have improved things substantially. No JVM crashes, yet.

I've submitted the full hs_err_pid.log through my Elastic Sales contact.

Also, thanks for catching the typo. My eyes are clearly not what the used to be. :upside_down_face:

Thank you.

Thanks for the update @Charles_Porter

When you get the chance, please try out this bugfix without the async_profiler_safe_mode setting and see if the problem is resolved. This snapshot contains the proposed fix for Async Profiler.


Thank you, @Eyal_Koren

The bugfix appears to be successful. After installing, I have disabled async_profiler_safe_mode and restarted the application. Simulated workloads have not produced any JVM crashes.

Thank you very much for helping.

Awesome! Thanks for reporting back!

I am also running load tests during the past ~48 hours with that fix on a setup that previously reproduced the issue you reported, with quite intensive load on async profiler (very frequent sampling) and it looks good.

For now, you can continue using this snapshot. We will make sure to include this fix in our next release.