SIGSEGV JVM Crash

When using Elastic APM java agent, the JVM crashes unexpectedly.
Tried using different AWS EC2 instances, happens on AMD and Graviton CPUs.
Happens with JVMs with small (4GB) and large (60GB) heaps.

Kibana version: 7.6.2

Elasticsearch version: 7.6.2

Java: OpenJDK Runtime Environment Corretto-17.0.5.8.1 64it (x86 and Aarch)

APM Agent language and version: Java 1.35.0

OS: Amazon Linux 2 (x86 and Aarch)

I've attached 3 JVM crash files below as replies. I have about 15 more.

Unfortunately I cannot reproduce the issue on demand, to me it appears to happen randomly.

It look a bit like Segmentation fault when attaching apm-agent-java to adopt jdk 11 · Issue #864 · elastic/apm-agent-java · GitHub, but with Java 17

Thanks for the error reports. Do you have any correlation between stacks (or top of stack) and CPU architecture? Do you have any more info on when it tends to happen (how far in to the application run) and how often it happens (eg 10% of runs?). Thanks

If you can attach all the crash logs, we can do that analysis. This looks like it's not going to be easy to figure out

Thank you for having a look.

It is usually between 30min and 3 hours after jvm startup. It appears to happen when we are using the app. So it happens during daytime (not at night) and more often if more users are using the app. It usually happens when doing an AWS S3 upload or download using the software.amazon.awssdk s3 (version 2.19.2).

It happens on Aarch (Graviton 2) and X86 64 bit (AMD Epyc) cpu architectures.

This issue did not happen with the java apm agent version 1.20.0 on Java 11.0.16.1.

Crash logs: hs_err - Google Drive

Have you also run without the Elastic agent on the config that crashes (coretto 17) and found it stable?

Yes, that is correct. Removing the

-javaagent:/home/username/elastic-apm-agent-1.35.0.jar

jvm startup parameter fixed the issue and we are not experiencing any crashes.

thanks, the larger set of crash logs has a couple of crashes that show intercepts of the Elastic methods happening from AWS interceptors. Are you explicitly adding interceptors, or is that something Amazon does automatically (maybe eg for XRay tracking of errors)?

We are not explicitly adding interceptors.

And no dependencies on aws-xray-* ?

It might be worth noting that we are running on app server Payara 6.2022.2.
As far as I can see they don't include the aws xray dependency either.

Could you try with this 1.35 snapshot please, it is just the latest build with the throwables no longer captured for those paths (PR). We suspect this is an interplay with how coretto specializes Throwable handling on AWS machines. If this stops the crashes here, we'll look at how best to apply it more generically

Sure, will give it a go and revert back tomorrow.

It works

Thanks! We'll produce a more complete PR and include that in an upcoming release

Than you. I appreciate it

Just for completeness, please test the updated snapshot which looks specifically for a corretto JVM and avoids the capture only in that case

Will do, apologies for the delay. It will be tested tomorrow

actually we're about to release the new version with the workaround, so just test that when it's out please rather than the snapshot

1 Like

v1.36.0 works perfectly, thank you.

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.