SIGSEGV JVM Crash with ZGC Garbage collection

Kibana version: 7.17.5

Elasticsearch version: 7.17.5

Java: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (build 17.0.6+10, mixed mode, sharing)

APM Agent language and version: Java 1.37.0

OS: Alpine Linux v3.17

Running on a google managed kubernetes cluster, v1.24.12-gke.500 - we have APM on about 85 different microservices, only seems to be affecting this microservice/set of pods.

I cant reproduce it, happens randomly, but usually withing 24 hours of a new pod rolling out. I have coredumps and full error files for 3 crashes so far, they are pretty big (up to 9Gb)

---------------  T H R E A D  ---------------

Current thread (0x00007f74cf32c380):  GCTaskThread "ZWorker#1" [stack: 0x00007f74cf126000,0x00007f74cf226aa8] [id=14]

Stack: [0x00007f74cf126000,0x00007f74cf226aa8],  sp=0x00007f74cf2205e0,  free space=1001k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  []  ZBarrier::mark_barrier_on_oop_slow_path(unsigned long)+0x8b
V  []  void OopOopIterateDispatch<ZMarkBarrierOopClosure<false> >::Table::oop_oop_iterate<InstanceKlass, oopDesc*>(ZMarkBarrierOopClosure<false>*, oopDesc*, Klass*)+0x93
V  []  ZMark::follow_object(oopDesc*, bool)+0xb0
V  []  ZMark::work_without_timeout(ZMarkCache*, ZMarkStripe*, ZMarkThreadLocalStacks*)+0xcf
V  []  ZMark::work(unsigned long)+0x8f
V  []  ZTask::GangTask::work(unsigned int)+0x1c
V  []  GangWorker::loop()+0x5f
V  []
V  []  Thread::call_run()+0xc0
V  []  thread_native_entry(Thread*)+0x131

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000004

Register to memory mapping:

RAX=0x0000000000000064 is an unknown value
RBX=0x00000374d93b52e8 is an unknown value
RCX=0x00007f74e75a1068: <offset 0x0000000001370068> in /opt/java/openjdk/lib/server/ at 0x00007f74e6231000
RDX=0x00007f74e7549140: <offset 0x0000000001318140> in /opt/java/openjdk/lib/server/ at 0x00007f74e6231000
RSP=0x00007f74cf2205e0 points into unknown readable memory: 0x0000040081e7cd58 | 58 cd e7 81 00 04 00 00
RBP=0x00007f74cf220610 points into unknown readable memory: 0x00007f74cf220660 | 60 06 22 cf 74 7f 00 00
RSI=0x000008007edd8c20 is a good oop: java.lang.String 
{0x000008007edd8c20} - klass: 'java/lang/String'
 - string: "co.elastic.apm.exception"
RDI=0x00007f74d93b52e8 is at entry_point+2728 in (nmethod*)0x00007f74d93b4190
R8 =0x00007f74de5924d8 points into unknown readable memory: 0x0000000000000003 | 03 00 00 00 00 00 00 00
R9 =0x00007f74cf220670 points into unknown readable memory: 0x00007f74e7476900 | 00 69 47 e7 74 7f 00 00
R10=0x0 is NULL
R11=0x00007f74cf340880 points into unknown readable memory: 0xffffffff0003af9a | 9a af 03 00 ff ff ff ff
R12=0x00000b74d93b52e8 is an unknown value
R13=0x0 is NULL
R14=0x00007f74cf340440 points into unknown readable memory: 0x00007f74e7476478 | 78 64 47 e7 74 7f 00 00
R15=0x000008002225d6f8 points into unknown readable memory: 0x00007f74d93b52e8 | e8 52 3b d9 74 7f 00 00
1 Like

This is still an ongoing issue, we have changed:

  • APM Agent to version 1.42.0
  • Base Image from Alpine to Ubuntu Jammy (to test musl versus glibc)

But the problem persists.

Linked Issues:

Can also confirm that updating to Java did not resolve the issue

Increased Stack size to 10m, still SIGSEGV.

Removing the APM Java agent eliminates all SIGSEGV errors and the pods are stable now. Clearly APM related, but no clear solution.