Jre crash after running for days or hours

Hi community,
My single node elasticsearch cluster crashed after running for days or hours.
And here are relevant information and logs.

Elasticsearch version (bin/elasticsearch --version):

7.10.2

Plugins installed:

I install elasticsearch following this doc.
No other plugin installed.

JVM version (java -version):

  • JRE version: OpenJDK Runtime Environment AdoptOpenJDK (15.0.1+9) (build 15.0.1+9)
  • Java VM: OpenJDK 64-Bit Server VM AdoptOpenJDK (15.0.1+9, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)

OS version (uname -a if on a Unix-like system):

Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-42-generic x86_64)
Linux sw-vwordpress01 4.15.0-91-generic #92-Ubuntu SMP Fri Feb 28 11:09:48 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

I am running a single node elasticsearch cluster for elastic observability.
After running for hours or days, the cluster crash.

Steps to reproduce:

I use apm-agent-dotnet v1.6.1 to send apm transaction & metrics to APM server.
The APM server stay on the same server which host elasticsearch single node cluster.
After running for hours or days, the cluster crash.
And then it produce a hs_err_pidXXXXX.log in /var/log/elasticsearch directory.

# Problematic frame:
# J 14564 c2 org.apache.lucene.codecs.DocValuesConsumer$SortedNumericDocValuesSub.nextDoc()I (8 bytes) @ 0x00007f650c700669 [0x00007f650c700620+0x0000000000000049]

Provide logs (if relevant):

  • hs_err_pid13402.log

  • hs_err_pid1143.log

  • Execute sudo systemctl status elasticsearch show the following messages

    root@s-docker01:/var/log/elasticsearch# sudo systemctl status elasticsearch
    ● elasticsearch.service - Elasticsearch
         Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
         Active: failed (Result: signal) since Mon 2021-01-25 23:15:03 CST; 55min ago
           Docs: https://www.elastic.co
        Process: 1143 ExecStart=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet (code=killed, signal=ABRT)
       Main PID: 1143 (code=killed, signal=ABRT)
          Tasks: 0 (limit: 38033)
         Memory: 398.3M
         CGroup: /system.slice/elasticsearch.service
    
    Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: # An error report file with more information is saved as:
    Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: # /var/log/elasticsearch/hs_err_pid1143.log
    Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: #
    Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: # If you would like to submit a bug report, please visit:
    Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: #   https://github.com/AdoptOpenJDK/openjdk-support/issues
    Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: # The crash happened outside the Java Virtual Machine in native code.
    Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: # See problematic frame for where to report the bug.
    Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: #
    Jan 25 23:15:03 s-docker01 systemd[1]: elasticsearch.service: Main process exited, code=killed, status=6/ABRT
    Jan 25 23:15:03 s-docker01 systemd[1]: elasticsearch.service: Failed with result 'signal'.
    

Yesterday my single node elasticsearch cluster crash again, then I dump relevant logs.
But I cannot tell which hardware caused it from the output of dmesg.
Could anyone help me to point out where the problem is?
I also opend a issue here.

Provide logs (if relevant):

  • hs_err_pid1144.log

  • dmesg_2021-01-27.log

  • Execute sudo systemctl status elasticsearch show the following messages

    ● elasticsearch.service - Elasticsearch
         Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
         Active: failed (Result: signal) since Wed 2021-01-27 09:14:35 CST; 24min ago
           Docs: https://www.elastic.co
        Process: 1144 ExecStart=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet (code=killed, signal=ABRT)
       Main PID: 1144 (code=killed, signal=ABRT)
          Tasks: 0 (limit: 38033)
         Memory: 4.0G
         CGroup: /system.slice/elasticsearch.service
    
    Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]:  scopes data    [0x00007ffa1cbd3e88,0x00007ffa1cbd3e98] = 16
    Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]:  scopes pcs     [0x00007ffa1cbd3e98,0x00007ffa1cbd3ec8] = 48
    Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]:  dependencies   [0x00007ffa1cbd3ec8,0x00007ffa1cbd3ed0] = 8
    Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]:  handler table  [0x00007ffa1cbd3ed0,0x00007ffa1cbd3ee8] = 24
    Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]: #
    Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]: # If you would like to submit a bug report, please visit:
    Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]: #   https://github.com/AdoptOpenJDK/openjdk-support/issues
    Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]: #
    Jan 27 09:14:35 s-docker01 systemd[1]: elasticsearch.service: Main process exited, code=killed, status=6/ABRT
    Jan 27 09:14:35 s-docker01 systemd[1]: elasticsearch.service: Failed with result 'signal'.
    

Welcome to our community! :smiley:

What's in the Elasticsearch log, usually under /var/log/elasticsearch/?

1 Like

Hi, mark

Here are logs under /var/log/elasticsearch/.

Can you please post them if they are not too big, or use gist/pastebin/etc.

Hi, mark
Here is the gist.
Please click the link to see total 18 logs.

[Mon Jan 25 22:05:40 2021] node[1175]: segfault at 1 ip 0000000000000001 sp 00007ffee4ae5248 error 14 in node[400000+204e000]
[Mon Jan 25 22:05:40 2021] Code: Bad RIP value.
[Tue Jan 26 20:48:05 2021] traps: node[5038] trap invalid opcode ip:17aad89 sp:7fc92b992898 error:0 in node[400000+204e000]

This looks like bad hardware, although more likely bad RAM or CPU rather than storage. Does this reproduce on a different machine?

1 Like

I have moved my single node elasticsearch cluster to a different machine, and it runs healthily for about 4 days.
And there is no hardware error message in the output of dmesg.
I will keep watching and report here.

Thanks for your help.

1 Like

Hi community,
It has been about 3 weeks since my last report.
And my elasticsearch cluster stays healthy since then.
We can confirm the problem is about bad hardware.
Thank you all.
:wink:

2 Likes