JVM corrupted with java agent 1.19.0 - CentOS

Hi everyone! We had the APM java agent with Tomcat successfully running in test environments for a week but after I modified some configurations of the agent(see below), JVM was occasionally corrupted and left the web app unusable. The error log indicates(attached at the end) it is related to libasyncProfiler. Does anyone know what could cause the issue and what we should do to fix it? Thanks!

OS version: CentOS Linux, Version 7(core)
Java Version: openjdk version "1.8.0_275"
JVM Setting: default GC setting with Parallel GC, maximum JVM size is 6GB
APM java agent version: 1.19.0

APM java agent configuration
The APM agent is setup with the Tomcat process with -javaagent flag

/java/1.8.0_275/bin/java -javaagent:elastic-apm-agent-1.19.0.jar -Delastic.apm.config_file=elastic_apm.properties -Delastic.apm.log_file=elastic-apm.log -Delastic.apm.service_name=web-java

What I changed:

transaction_max_spans from 500 to 800
profiling_inferred_spans_min_duration from 25ms to 15 ms

Current elastic_apm.properties:

recording=true
transaction_sample_rate=1.0
max_queue_size=512
server_urls=http://***.com:8200
use_path_as_transaction_name=true
transaction_ignore_urls=/cookie,/liveMessage*,*.js,*.css,*.jpg,*.jpeg,*.png,*.gif,*.svg
application_packages=ixl,quia
capture_headers=false
capture_body=off
metrics_interval=15s
transaction_max_spans=800
span_min_duration=15ms
profiling_inferred_spans_min_duration=15ms
profiling_inferred_spans_sampling_interval=5ms
profiling_inferred_spans_enabled=true
log_file_size=50mb

Error Log
It didn't produce a fatal error log like hs_err_pid.log and only printed some logs in jvm logs:

*** Error in `/opt/java/1.8.0_275/bin/java': double free or corruption (!prev): 0x00007f0dc47550b0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x7f0ecfbce299]
/opt/java/1.8.0_275/jre/lib/amd64/server/libjvm.so(+0x64edab)[0x7f0ecf151dab]
/opt/java/1.8.0_275/jre/lib/amd64/server/libjvm.so(+0x7aa513)[0x7f0ecf2ad513]
/opt/java/1.8.0_275/jre/lib/amd64/server/libjvm.so(+0x79ebc4)[0x7f0ecf2a1bc4]
/home/temp/tomcat_temp/libasyncProfiler-linux-x64-7b0005bb0afc7e225ffe8372878fd01a.so(_ZN8Profiler17getJavaTraceJvmtiEP15_jvmtiFrameInfoP15ASGCT_CallFramei+0x64)[0x7f0dde2b58f4]
/home/temp/tomcat_temp/libasyncProfiler-linux-x64-7b0005bb0afc7e225ffe8372878fd01a.so(_ZN8Profiler17getJavaTraceAsyncEPvP15ASGCT_CallFramei+0x15b)[0x7f0dde2b5c9b]
/home/temp/tomcat_temp/libasyncProfiler-linux-x64-7b0005bb0afc7e225ffe8372878fd01a.so(_ZN8Profiler12recordSampleEPvyiP10_jmethodID11ThreadState+0x20c)[0x7f0dde2b632c]
/lib64/libpthread.so.0(+0xf630)[0x7f0ed0346630]
/lib64/libpthread.so.0(pthread_cond_timedwait+0x132)[0x7f0ed0342de2]
/opt/java/1.8.0_275/jre/lib/amd64/server/libjvm.so(+0x964b5a)[0x7f0ecf467b5a]
/opt/java/1.8.0_275/jre/lib/amd64/server/libjvm.so(+0xb00cb2)[0x7f0ecf603cb2]
[0x7f0eb2fa89ea]
======= Memory map: ========
649000000-722280000 rw-p 00000000 00:00 0 
722280000-743000000 ---p 00000000 00:00 0 
743000000-7ab000000 rw-p 00000000 00:00 0 
7ab000000-7c0000000 ---p 00000000 00:00 0 
7c0000000-7c2060000 rw-p 00000000 00:00 0 
7c2060000-800000000 ---p 00000000 00:00 0 
556d4cdac000-556d4cdad000 r-xp 00000000 08:03 11272234                   /opt/java/1.8.0_275/bin/java
556d4cfac000-556d4cfad000 r--p 00000000 08:03 11272234                   /opt/java/1.8.0_275/bin/java
556d4cfad000-556d4cfae000 rw-p 00001000 08:03 11272234                   /opt/java/1.8.0_275/bin/java
556d4e340000-556d4e361000 rw-p 00000000 00:00 0                          [heap]
7f0c7cf7a000-7f0c7d13a000 rw-p 00000000 00:00 0 
7f0c7d13a000-7f0c7d17a000 ---p 00000000 00:00 0 
7f0c7d17a000-7f0c7d37a000 rw-p 00000000 00:00 0 
7f0c7d37a000-7f0c7d57a000 rw-p 00000000 00:00 0 
7f0c7d57a000-7f0c7d77a000 rw-p 00000000 00:00 0 
7f0c7d77a000-7f0c7d97a000 rw-p 00000000 00:00 0 
7f0c7d97a000-7f0c7db7a000 rw-p 00000000 00:00 0 
7f0c7db7a000-7f0c7dd7a000 rw-p 00000000 00:00 0 
7f0c7dd7a000-7f0c7df7a000 rw-p 00000000 00:00 0 
7f0c7df7a000-7f0c7e17a000 rw-p 00000000 00:00 0 
7f0c7e17a000-7f0c7e37a000 rw-p 00000000 00:00 0 
7f0c7e37a000-7f0c7e3a3000 r-xp 00000000 08:03 2622258                    /usr/lib64/libpng15.so.15.13.0
7f0c7e3a3000-7f0c7e5a3000 ---p 00029000 08:03 2622258                    /usr/lib64/libpng15.so.15.13.0
7f0c7e5a3000-7f0c7e5a4000 r--p 00029000 08:03 2622258                    /usr/lib64/libpng15.so.15.13.0
7f0c7e5a4000-7f0c7e5a5000 rw-p 0002a000 08:03 2622258                    /usr/lib64/libpng15.so.15.13.0
7f0c7e5a5000-7f0c7e5b4000 r-xp 00000000 08:03 2622237                    /usr/lib64/libbz2.so.1.0.6
7f0c7e5b4000-7f0c7e7b3000 ---p 0000f000 08:03 2622237                    /usr/lib64/libbz2.so.1.0.6
7f0c7e7b3000-7f0c7e7b4000 r--p 0000e000 08:03 2622237                    /usr/lib64/libbz2.so.1.0.6
7f0c7e7b4000-7f0c7e7b5000 rw-p 0000f000 08:03 2622237                    /usr/lib64/libbz2.so.1.0.6
7f0c7e7b5000-7f0c7e7ca000 r-xp 00000000 08:03 2622137                    /usr/lib64/libz.so.1.2.7
7f0c7e7ca000-7f0c7e9c9000 ---p 00015000 08:03 2622137                    /usr/lib64/libz.so.1.2.7
7f0c7e9c9000-7f0c7e9ca000 r--p 00014000 08:03 2622137                    /usr/lib64/libz.so.1.2.7
7f0c7e9ca000-7f0c7e9cb000 rw-p 00015000 08:03 2622137                    /usr/lib64/libz.so.1.2.7
7f0c7e9cb000-7f0c7ea82000 r-xp 00000000 08:03 2622235                    /usr/lib64/libfreetype.so.6.14.0
7f0c7ea82000-7f0c7ec82000 ---p 000b7000 08:03 2622235                    /usr/lib64/libfreetype.so.6.14.0
7f0c7ec82000-7f0c7ec89000 r--p 000b7000 08:03 2622235                    /usr/lib64/libfreetype.so.6.14.0
7f0c7ec89000-7f0c7ec8a000 rw-p 000be000 08:03 2622235                    /usr/lib64/libfreetype.so.6.14.0
7f0c7ec8a000-7f0c7ecf2000 r-xp 00000000 08:03 11272346                   /opt/java/1.8.0_275/jre/lib/amd64/libfontmanager.so
7f0c7ecf2000-7f0c7eef2000 ---p 00068000 08:03 11272346                   /opt/java/1.8.0_275/jre/lib/amd64/libfontmanager.so
7f0c7eef2000-7f0c7eef5000 r--p 00068000 08:03 11272346                   /opt/java/1.8.0_275/jre/lib/amd64/libfontmanager.so
7f0c7eef5000-7f0c7eef6000 rw-p 0006b000 08:03 11272346                   /opt/java/1.8.0_275/jre/lib/amd64/libfontmanager.so
7f0c7eef6000-7f0c7f0f7000 rw-p 00000000 00:00 0 
7f0c7f0f7000-7f0c7f2f7000 rw-p 00000000 00:00 0 
.............. 

Thanks for reporting this @bbking,

This is not an already known issue, thus you may have contributed to uncovering a new bug!

From the error you reported it seems that async-profiler is directly involved in the stack trace, thus I've open an issue so it can be properly investigated.

So if I understand correctly:

  • you changed a few settings transaction_max_spans and profiling_inferred_spans_min_duration
  • this has happened multiple times in a test environment
  • you only tested this with version 1.19.0 of the agent
  • the application worked fine until you changed the settings.

What is the frequency of those crashes after changing the settings ?

Changing those parameters only influence the amount of captured spans per transaction and the minimum duration for spans that are created with async-profiler, it should normally only have an influence on the amount of data that is collected and not crash the JVM.

While this error might not be related to that, we know from experience that a sensitive part of async-profiler is stack trace recovery strategies, you can disable them by using the following configuration in elastic_apm.properties:
async_profiler_safe_mode=31.

This async_profiler_safe_mode is a bit field, thus using a value of 31 will disable all of them.
If with that you still experience crashes, we know that it's something else in async-profiler is causing this.

Update after consulting with async-profiler author:

It is related to the way async-profiler works, and setting
async_profiler_safe_mode=16 should allow to work-around this crash cause.

Thanks for looking into it! The frequency of crashes after changing the settings is pretty high and it crashed our test environments more than 50% of the time.

I didn't see async_profiler_safe_mode in the property reference doc. It would be nice to add it in the doc.

Could you explain a bit more about what async_profiler_safe_mode=16 does? What's the difference between mode 16 and 31? Does disabling it have the same effect of turning off stacktrace collection?

Thanks!

This config is not documented on purpose, but we may reconsider that. It is one we typically use only for analysis.

The Async Profiler's area that deals with recovering stack trace frames is known to be sensitive in some systems. It is used as a bit mask with values between 0 and 31, where 0 enables all recovery attempts and 31 disables all five (corresponding 1, 2, 4, 8 and 16). See comment by the async profiler author on the corresponding GitHub, explaining why 16.

No. Using 31 means disabling the recovery of stack frames that failed to be obtained in the original stack trace capture. Using 16 means disabling it when occurs during GC.
This has nothing to do with the stack trace capturing that the agent does for spans, it is only relevant for the Sampling-profiler mechanism.

@bbking we would appreciate if you can update your findings going forward within the GitHub issue referred above, instead of doing it here.
Thanks!

Thanks for your answers! I will comment on the GitHub issue after running the agent with async_profiler_safe_mode=16 for a few days.

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.