Elasticsearch 6.8 in Docker crashing with SIGILL after restart - corrupt state?

Somewhat recently we upgraded our cluster to 6.8, and in the process started running elasticsearch in Docker containers, one per host, using the elasticsearch:6.8.0 image from docker hub. Testing and deployment went fine, but now we're doing system updates and find that any time we stop and restart a container, it crashes on startup (while reading index state) with SIGILL:

elasticsearch_1  | [2019-09-20T14:56:03,000][INFO ][o.e.x.m.p.l.CppLogMessageHandler]     [esworker09] [controller/113] [Main.cc@109] controller (64 bit): Version 6.8.0 (Build e6cf25e2acc5ec) Copyright (c) 2019 Elasticsearch BV
elasticsearch_1  | #
elasticsearch_1  | # A fatal error has been detected by the Java Runtime Environment:
elasticsearch_1  | #
elasticsearch_1  | #  SIGILL (0x4) at pc=0x00007fde7c97c8c8, pid=1, tid=88
elasticsearch_1  | #
elasticsearch_1  | # JRE version: OpenJDK Runtime Environment (12.0.1+12) (build 12.0.1+12)
elasticsearch_1  | # Java VM: OpenJDK 64-Bit Server VM (12.0.1+12, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
elasticsearch_1  | # Problematic frame:
elasticsearch_1  | # J 4984 c2 com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.findName([II)Ljava/lang/String; (198 bytes) @ 0x00007fde7c97c8c8 [0x00007fde7c97c7a0+0x0000000000000128]
elasticsearch_1  | #

The only consistent solution we've found so far is clearing the data directories, which are bind-mounted into the container from the host. It doesn't matter whether we restart the old container or create a new one. Further complicating matters, one node was left in a crash loop for about 12 hours and then it started working with no intervention. Upgrading to 6.8.3 with OpenJDK 12.0.2 did not help.

Any suggestions on how to further troubleshoot this?

Follow-up in case anyone else runs across this: after the help of some patient engineers at Elastic{ON} yesterday we determined that the JVM was sometimes misidentifying the available instruction sets (specifically SSSE3) on our older Opterons. Manually limiting to SSE2 (-XX:UseSSE=2) seems to have solved it.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.