SIGSEGV JVM Crash with ZGC Garbage collection

Kim_Attree · June 30, 2023, 6:18am

Kibana version: 7.17.5

Elasticsearch version: 7.17.5

Java: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (build 17.0.6+10, mixed mode, sharing)

APM Agent language and version: Java 1.37.0

OS: Alpine Linux v3.17

Running on a google managed kubernetes cluster, v1.24.12-gke.500 - we have APM on about 85 different microservices, only seems to be affecting this microservice/set of pods.

I cant reproduce it, happens randomly, but usually withing 24 hours of a new pod rolling out. I have coredumps and full error files for 3 crashes so far, they are pretty big (up to 9Gb)

---------------  T H R E A D  ---------------

Current thread (0x00007f74cf32c380):  GCTaskThread "ZWorker#1" [stack: 0x00007f74cf126000,0x00007f74cf226aa8] [id=14]

Stack: [0x00007f74cf126000,0x00007f74cf226aa8],  sp=0x00007f74cf2205e0,  free space=1001k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xf3dffb]  ZBarrier::mark_barrier_on_oop_slow_path(unsigned long)+0x8b
V  [libjvm.so+0xf58343]  void OopOopIterateDispatch<ZMarkBarrierOopClosure<false> >::Table::oop_oop_iterate<InstanceKlass, oopDesc*>(ZMarkBarrierOopClosure<false>*, oopDesc*, Klass*)+0x93
V  [libjvm.so+0xf55890]  ZMark::follow_object(oopDesc*, bool)+0xb0
V  [libjvm.so+0xf564cf]  ZMark::work_without_timeout(ZMarkCache*, ZMarkStripe*, ZMarkThreadLocalStacks*)+0xcf
V  [libjvm.so+0xf572bf]  ZMark::work(unsigned long)+0x8f
V  [libjvm.so+0xf7d41c]  ZTask::GangTask::work(unsigned int)+0x1c
V  [libjvm.so+0xf3846f]  GangWorker::loop()+0x5f
V  [libjvm.so+0xf384cf]
V  [libjvm.so+0xe86de0]  Thread::call_run()+0xc0
V  [libjvm.so+0xc36fe1]  thread_native_entry(Thread*)+0x131


siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000004

Register to memory mapping:

RAX=0x0000000000000064 is an unknown value
RBX=0x00000374d93b52e8 is an unknown value
RCX=0x00007f74e75a1068: <offset 0x0000000001370068> in /opt/java/openjdk/lib/server/libjvm.so at 0x00007f74e6231000
RDX=0x00007f74e7549140: <offset 0x0000000001318140> in /opt/java/openjdk/lib/server/libjvm.so at 0x00007f74e6231000
RSP=0x00007f74cf2205e0 points into unknown readable memory: 0x0000040081e7cd58 | 58 cd e7 81 00 04 00 00
RBP=0x00007f74cf220610 points into unknown readable memory: 0x00007f74cf220660 | 60 06 22 cf 74 7f 00 00
RSI=0x000008007edd8c20 is a good oop: java.lang.String 
{0x000008007edd8c20} - klass: 'java/lang/String'
 - string: "co.elastic.apm.exception"
RDI=0x00007f74d93b52e8 is at entry_point+2728 in (nmethod*)0x00007f74d93b4190
R8 =0x00007f74de5924d8 points into unknown readable memory: 0x0000000000000003 | 03 00 00 00 00 00 00 00
R9 =0x00007f74cf220670 points into unknown readable memory: 0x00007f74e7476900 | 00 69 47 e7 74 7f 00 00
R10=0x0 is NULL
R11=0x00007f74cf340880 points into unknown readable memory: 0xffffffff0003af9a | 9a af 03 00 ff ff ff ff
R12=0x00000b74d93b52e8 is an unknown value
R13=0x0 is NULL
R14=0x00007f74cf340440 points into unknown readable memory: 0x00007f74e7476478 | 78 64 47 e7 74 7f 00 00
R15=0x000008002225d6f8 points into unknown readable memory: 0x00007f74d93b52e8 | e8 52 3b d9 74 7f 00 00

Kim_Attree · September 29, 2023, 3:32am

This is still an ongoing issue, we have changed:

APM Agent to version 1.42.0
Base Image from Alpine to Ubuntu Jammy (to test musl versus glibc)

But the problem persists.

Kim_Attree · September 29, 2023, 3:33am

Linked Issues:

github.com/adoptium/adoptium-support

SIGSEGV on ZBarrier::mark_barrier_on_oop_slow_path

opened 01:46PM - 19 Apr 23 UTC

kimattree

bug Waiting on OP stale

### Please provide a brief summary of the bug I have experienced this crash acr…oss 2 x microservices in the same cluster, randomly there is a crash with the reported error: # V [libjvm.so+0xf32adb] ZBarrier::mark_barrier_on_oop_slow_path(unsigned long)+0x8b ### Please provide steps to reproduce where possible Unable to reproduce manually, has occurred randomly over the past week 4 times across 2 services. ### Expected Results no crash and normal operation ### Actual Results java crashes, kubernetes liveness probe fails and pod restarts. ### What Java Version are you using? JRE version: OpenJDK Runtime Environment Temurin-17.0.3+7 (17.0.3+7) (build 17.0.3+7) ### What is your operating system and platform? Microservice pod running from base image "amd64/eclipse-temurin:17.0.3_7-jdk-alpine" Running in a GKE Kubernetes 1.23 cluster ### How did you install Java? Used the amd64/eclipse-temurin:17.0.3_7-jdk-alpine image from docker hub ### Did it work before? ```Shell we have used this version for 6+ months - first recorded instance we've noticed but cannot discount previous events occurring - we have only 30 days of logs ``` ### Did you test with the latest update version? ```Shell no ``` ### Did you test with other Java versions? ```Shell no ``` ### Relevant log output ```Shell Apr 14, 2023 @ 12:18:21.003 # Problematic frame: Apr 14, 2023 @ 12:18:21.003 # SIGSEGV (0xb) at pc=0x00007f6ef61b2adb, pid=7, tid=13 Apr 14, 2023 @ 12:18:21.003 # JRE version: OpenJDK Runtime Environment Temurin-17.0.3+7 (17.0.3+7) (build 17.0.3+7) Apr 14, 2023 @ 12:18:21.003 # Apr 14, 2023 @ 12:18:21.003 # A fatal error has been detected by the Java Runtime Environment: Apr 14, 2023 @ 12:18:21.003 # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.3+7 (17.0.3+7, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64) Apr 14, 2023 @ 12:18:21.007 # Core dump will be written. Default location: /core.%e.7.%t Apr 14, 2023 @ 12:18:21.007 # //hs_err_pid7.log Apr 14, 2023 @ 12:18:21.007 # Apr 14, 2023 @ 12:18:21.007 # V [libjvm.so+0xf32adb] ZBarrier::mark_barrier_on_oop_slow_path(unsigned long)+0x8b Apr 14, 2023 @ 12:18:21.007 # An error report file with more information is saved as: Apr 14, 2023 @ 12:18:21.009 metadata [0x00007f6ee8d65b68,0x00007f6ee8d66138] = 1488 Apr 14, 2023 @ 12:18:21.009 stub code [0x00007f6ee8d65a00,0x00007f6ee8d65b08] = 264 Apr 14, 2023 @ 12:18:21.009 scopes pcs [0x00007f6ee8d69728,0x00007f6ee8d6da38] = 17168 Apr 14, 2023 @ 12:18:21.009 dependencies [0x00007f6ee8d6da38,0x00007f6ee8d6dc00] = 456 Apr 14, 2023 @ 12:18:21.009 relocation [0x00007f6ee8d5ef70,0x00007f6ee8d5f4a0] = 1328 Apr 14, 2023 @ 12:18:21.009 total in heap [0x00007f6ee8d5ee10,0x00007f6ee8d6e2c0] = 62640 Apr 14, 2023 @ 12:18:21.009 scopes data [0x00007f6ee8d66138,0x00007f6ee8d69728] = 13808 Apr 14, 2023 @ 12:18:21.009 handler table [0x00007f6ee8d6dc00,0x00007f6ee8d6e0f8] = 1272 Apr 14, 2023 @ 12:18:21.009 nul chk table [0x00007f6ee8d6e0f8,0x00007f6ee8d6e2c0] = 456 Apr 14, 2023 @ 12:18:21.009 Compiled method (c2) 6896098 33692 ! 4 org.springframework.web.servlet.DispatcherServlet::doDispatch (543 bytes) Apr 14, 2023 @ 12:18:21.009 main code [0x00007f6ee8d5f4a0,0x00007f6ee8d65a00] = 25952 Apr 14, 2023 @ 12:18:21.009 oops [0x00007f6ee8d65b08,0x00007f6ee8d65b68] = 96 Apr 14, 2023 @ 12:18:21.012 oops [0x00007f6ee8d65b08,0x00007f6ee8d65b68] = 96 Apr 14, 2023 @ 12:18:21.012 nul chk table [0x00007f6ee8d6e0f8,0x00007f6ee8d6e2c0] = 456 Apr 14, 2023 @ 12:18:21.012 relocation [0x00007f6ee8d5ef70,0x00007f6ee8d5f4a0] = 1328 Apr 14, 2023 @ 12:18:21.012 dependencies [0x00007f6ee8d6da38,0x00007f6ee8d6dc00] = 456 Apr 14, 2023 @ 12:18:21.012 Compiled method (c2) 6896101 33692 ! 4 org.springframework.web.servlet.DispatcherServlet::doDispatch (543 bytes) Apr 14, 2023 @ 12:18:21.012 total in heap [0x00007f6ee8d5ee10,0x00007f6ee8d6e2c0] = 62640 Apr 14, 2023 @ 12:18:21.012 handler table [0x00007f6ee8d6dc00,0x00007f6ee8d6e0f8] = 1272 Apr 14, 2023 @ 12:18:21.012 stub code [0x00007f6ee8d65a00,0x00007f6ee8d65b08] = 264 Apr 14, 2023 @ 12:18:21.012 metadata [0x00007f6ee8d65b68,0x00007f6ee8d66138] = 1488 Apr 14, 2023 @ 12:18:21.012 scopes data [0x00007f6ee8d66138,0x00007f6ee8d69728] = 13808 Apr 14, 2023 @ 12:18:21.012 main code [0x00007f6ee8d5f4a0,0x00007f6ee8d65a00] = 25952 Apr 14, 2023 @ 12:18:21.012 scopes pcs [0x00007f6ee8d69728,0x00007f6ee8d6da38] = 17168 Apr 14, 2023 @ 12:18:21.042 # If you would like to submit a bug report, please visit: Apr 14, 2023 @ 12:18:21.042 # https://github.com/adoptium/adoptium-support/issues Apr 14, 2023 @ 12:18:21.042 # ```

github.com/adoptium/adoptium-support

SIGSEV on ObjectSynchronizer::FastHashCode

opened 01:50PM - 29 Jun 23 UTC

svenrienstra

bug Waiting on OP stale

### Please provide a brief summary of the bug We've now multiple times experi…enced a JVM crash in our cluster. We haven't been able to establish a pattern of when this happens or what triggers this. We get the following crash: ``` # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f269de9b474, pid=1, tid=600 # # JRE version: OpenJDK Runtime Environment Temurin-17.0.7+7 (17.0.7+7) (build 17.0.7+7) # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.7+7 (17.0.7+7, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0xe34474] ObjectSynchronizer::FastHashCode(Thread*, oopDesc*)+0x184 ``` [hs_err_pid1.log](https://github.com/adoptium/adoptium-support/files/11906396/hs_err_pid1.log) ### Please provide steps to reproduce where possible I've not been able to reproduce this manually. ### Expected Results A graceful exception, not the whole JVM to crash. ### Actual Results JVM crash ### What Java Version are you using? eclipse-temurin:17-alpine docker image ### What is your operating system and platform? eclipse-temurin:17-alpine running on Kubernetes cluster ### How did you install Java? _No response_ ### Did it work before? _No response_ ### Did you test with the latest update version? _No response_ ### Did you test with other Java versions? _No response_ ### Relevant log output ```Shell Current thread (0x00007f2659a640a0): JavaThread "http-nio-8880-exec-1" daemon [_thread_in_vm, id=600, stack(0x00007f263dfe8000,0x00007f263e0e8aa8)] Stack: [0x00007f263dfe8000,0x00007f263e0e8aa8], sp=0x00007f263e0e4e10, free space=1011k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xe34474] ObjectSynchronizer::FastHashCode(Thread*, oopDesc*)+0x184 V [libjvm.so+0x90432e] JVM_IHashCode+0x9e J 32297 java.lang.System.identityHashCode(Ljava/lang/Object;)I java.base@17.0.7 (0 bytes) @ 0x00007f268f78859a [0x00007f268f7884a0+0x00000000000000fa] J 63004 c2 com.blogspot.mydailyjava.weaklockfree.AbstractWeakConcurrentMap.containsKey(Ljava/lang/Object;)Z (46 bytes) @ 0x00007f268f203b14 [0x00007f268f203520+0x00000000000005f4] J 117851 c2 co.elastic.apm.agent.loginstr.reformatting.AbstractEcsReformattingHelper.onAppendEnter(Ljava/lang/Object;)Z (124 bytes) @ 0x00007f26945aa698 [0x00007f26945aa580+0x0000000000000118] j co.elastic.apm.agent.jul.reformatting.JulConsoleHandlerPublishAdvice.initializeReformatting(Ljava/util/logging/ConsoleHandler;)Z+4 J 121708 c2 java.util.logging.Logger.log(Ljava/util/logging/LogRecord;)V java.logging@17.0.7 (153 bytes) @ 0x00007f2694acacdc [0x00007f2694acab40+0x000000000000019c] J 96765 c1 java.util.logging.Logger.doLog(Ljava/util/logging/LogRecord;)V java.logging@17.0.7 (50 bytes) @ 0x00007f268a2ce814 [0x00007f268a2ce560+0x00000000000002b4] J 64547 c2 java.util.logging.Logger.log(Ljava/util/logging/Level;Ljava/lang/String;)V java.logging@17.0.7 (25 bytes) @ 0x00007f268fce398c [0x00007f268fce3660+0x000000000000032c] Register to memory mapping: RAX=0x00000000069da536 is an unknown value RBX=0x000000069da53680 is pointing into object: [Ljava.util.logging.Handler; {0x000000069da53670} - klass: 'java/util/logging/Handler'[] - length: 1 RCX=3313425028 is a compressed pointer to object: java.lang.ThreadLocal$ThreadLocalMap {0x000000062bf6d420} - klass: 'java/lang/ThreadLocal$ThreadLocalMap' - ---- fields (total size 3 words): - private 'size' 'I' @12 76 (4c) - private 'threshold' 'I' @16 170 (aa) - private 'table' '[Ljava/lang/ThreadLocal$ThreadLocalMap$Entry;' @20 a 'java/lang/ThreadLocal$ThreadLocalMap$Entry'[256] {0x00000006333e3e70} (c667c7ce) RDX=0x0000000000000006 is an unknown value RSP=0x00007f263e0e4e10 is pointing into the stack for thread: 0x00007f2659a640a0 RBP=0x00007f263e0e4e60 is pointing into the stack for thread: 0x00007f2659a640a0 RSI=0x63697461c1c33484 is an unknown value RDI=0x00007f2659a640a0 is a thread R8 =0x00007f2683eeeb6a points into unknown readable memory: 00 ff ff ff ff 00 R9 =3269675602 is a compressed pointer to object: co.elastic.apm.agent.weakconcurrent.CachedLookupKey$1 {0x00000006171a5290} - klass: 'co/elastic/apm/agent/weakconcurrent/CachedLookupKey$1' - ---- fields (total size 2 words): - private final 'threadLocalHashCode' 'I' @12 -1996036213 (8906e78b) R10=0x00007f268f788527 is at entry_point+135 in (nmethod*)0x00007f268f788310 R11=0x0000000000000006 is an unknown value R12=0x00000000d52903e6 is an unknown value R13=0xffffff80000000ff is an unknown value R14=0x000000069da53680 is pointing into object: [Ljava.util.logging.Handler; {0x000000069da53670} - klass: 'java/util/logging/Handler'[] - length: 1 R15=0x00007f2659a640a0 is a thread ```

Kim_Attree · October 10, 2023, 6:32am

Can also confirm that updating to Java 17.0.8.1_1 did not resolve the issue

Kim_Attree · October 12, 2023, 6:20am

Increased Stack size to 10m, still SIGSEGV.

Removing the APM Java agent eliminates all SIGSEGV errors and the pods are stable now. Clearly APM related, but no clear solution.

Topic		Replies	Views
SIGSEGV JVM Crash APM java	19	1455	February 25, 2023
JVM crashed after upgrade to apm java agent 1.19 APM java	17	1309	January 18, 2021
JVM crash originating from APM agent APM docker , java	8	794	August 2, 2023
Getting SIGSEV with elastic apm (version 1.35.0) APM java	2	365	May 23, 2023
[Too many errors, abort] Elasticsearch	4	2909	November 12, 2018

SIGSEGV JVM Crash with ZGC Garbage collection

Related Topics