Jre crash after running for days or hours

billhong-just · January 28, 2021, 2:08am

Hi community,
My single node elasticsearch cluster crashed after running for days or hours.
And here are relevant information and logs.

Elasticsearch version (bin/elasticsearch --version):

7.10.2

Plugins installed:

I install elasticsearch following this doc.
No other plugin installed.

JVM version (java -version):

JRE version: OpenJDK Runtime Environment AdoptOpenJDK (15.0.1+9) (build 15.0.1+9)
Java VM: OpenJDK 64-Bit Server VM AdoptOpenJDK (15.0.1+9, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)

OS version (uname -a if on a Unix-like system):

Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-42-generic x86_64)
Linux sw-vwordpress01 4.15.0-91-generic #92-Ubuntu SMP Fri Feb 28 11:09:48 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

I am running a single node elasticsearch cluster for elastic observability.
After running for hours or days, the cluster crash.

Steps to reproduce:

I use apm-agent-dotnet v1.6.1 to send apm transaction & metrics to APM server.
The APM server stay on the same server which host elasticsearch single node cluster.
After running for hours or days, the cluster crash.
And then it produce a hs_err_pidXXXXX.log in /var/log/elasticsearch directory.

# Problematic frame:
# J 14564 c2 org.apache.lucene.codecs.DocValuesConsumer$SortedNumericDocValuesSub.nextDoc()I (8 bytes) @ 0x00007f650c700669 [0x00007f650c700620+0x0000000000000049]

Provide logs (if relevant):

hs_err_pid13402.log
hs_err_pid1143.log

Execute sudo systemctl status elasticsearch show the following messages

root@s-docker01:/var/log/elasticsearch# sudo systemctl status elasticsearch
● elasticsearch.service - Elasticsearch
     Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
     Active: failed (Result: signal) since Mon 2021-01-25 23:15:03 CST; 55min ago
       Docs: https://www.elastic.co
    Process: 1143 ExecStart=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet (code=killed, signal=ABRT)
   Main PID: 1143 (code=killed, signal=ABRT)
      Tasks: 0 (limit: 38033)
     Memory: 398.3M
     CGroup: /system.slice/elasticsearch.service

Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: # An error report file with more information is saved as:
Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: # /var/log/elasticsearch/hs_err_pid1143.log
Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: #
Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: # If you would like to submit a bug report, please visit:
Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: #   https://github.com/AdoptOpenJDK/openjdk-support/issues
Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: # The crash happened outside the Java Virtual Machine in native code.
Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: # See problematic frame for where to report the bug.
Jan 25 23:15:02 s-docker01 systemd-entrypoint[1143]: #
Jan 25 23:15:03 s-docker01 systemd[1]: elasticsearch.service: Main process exited, code=killed, status=6/ABRT
Jan 25 23:15:03 s-docker01 systemd[1]: elasticsearch.service: Failed with result 'signal'.

billhong-just · January 28, 2021, 2:10am

Yesterday my single node elasticsearch cluster crash again, then I dump relevant logs.
But I cannot tell which hardware caused it from the output of dmesg.
Could anyone help me to point out where the problem is?
I also opend a issue here.

Provide logs (if relevant):

hs_err_pid1144.log
dmesg_2021-01-27.log

Execute sudo systemctl status elasticsearch show the following messages

● elasticsearch.service - Elasticsearch
     Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
     Active: failed (Result: signal) since Wed 2021-01-27 09:14:35 CST; 24min ago
       Docs: https://www.elastic.co
    Process: 1144 ExecStart=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet (code=killed, signal=ABRT)
   Main PID: 1144 (code=killed, signal=ABRT)
      Tasks: 0 (limit: 38033)
     Memory: 4.0G
     CGroup: /system.slice/elasticsearch.service

Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]:  scopes data    [0x00007ffa1cbd3e88,0x00007ffa1cbd3e98] = 16
Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]:  scopes pcs     [0x00007ffa1cbd3e98,0x00007ffa1cbd3ec8] = 48
Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]:  dependencies   [0x00007ffa1cbd3ec8,0x00007ffa1cbd3ed0] = 8
Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]:  handler table  [0x00007ffa1cbd3ed0,0x00007ffa1cbd3ee8] = 24
Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]: #
Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]: # If you would like to submit a bug report, please visit:
Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]: #   https://github.com/AdoptOpenJDK/openjdk-support/issues
Jan 27 09:14:34 s-docker01 systemd-entrypoint[1144]: #
Jan 27 09:14:35 s-docker01 systemd[1]: elasticsearch.service: Main process exited, code=killed, status=6/ABRT
Jan 27 09:14:35 s-docker01 systemd[1]: elasticsearch.service: Failed with result 'signal'.

warkolm · January 28, 2021, 2:21am

Welcome to our community!

What's in the Elasticsearch log, usually under /var/log/elasticsearch/?

billhong-just · January 28, 2021, 2:34am

Hi, mark

Here are logs under /var/log/elasticsearch/.

warkolm · January 28, 2021, 2:50am

Can you please post them if they are not too big, or use gist/pastebin/etc.

billhong-just · January 28, 2021, 3:08am

Hi, mark
Here is the gist.
Please click the link to see total 18 logs.

gist.github.com

https://gist.github.com/billhong-just/402962a3fd1f289646b8c350422afa58

elastic-observability-2021-01-24-1.log

[2021-01-24T00:10:00,006][INFO ][o.e.x.s.SnapshotLifecycleTask] [s-docker01] snapshot lifecycle policy [monitor-backup] issuing create snapshot [snapshot-2021.01.23-kwhbcbyitoa75nhizyxhaa]
[2021-01-24T00:10:00,032][INFO ][o.e.x.s.SnapshotLifecycleTask] [s-docker01] snapshot lifecycle policy job [monitor-backup-10] issued new snapshot creation for [snapshot-2021.01.23-kwhbcbyitoa75nhizyxhaa] successfully
[2021-01-24T00:10:00,078][INFO ][o.e.s.SnapshotsService   ] [s-docker01] snapshot [monitor:snapshot-2021.01.23-kwhbcbyitoa75nhizyxhaa/-NIoG_v1TUuzYk7U0oSaqA] started
[2021-01-24T00:10:53,070][INFO ][o.e.s.SnapshotsService   ] [s-docker01] snapshot [monitor:snapshot-2021.01.23-kwhbcbyitoa75nhizyxhaa/-NIoG_v1TUuzYk7U0oSaqA] completed with state [SUCCESS]
[2021-01-24T02:01:00,001][INFO ][o.e.x.m.MlDailyMaintenanceService] [s-docker01] triggering scheduled [ML] maintenance tasks
[2021-01-24T02:01:00,012][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [s-docker01] Deleting expired data
[2021-01-24T02:01:00,053][INFO ][o.e.x.m.j.r.UnusedStatsRemover] [s-docker01] Successfully deleted [0] unused stats documents
[2021-01-24T02:01:00,054][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [s-docker01] Completed deletion of expired ML data
[2021-01-24T02:01:00,054][INFO ][o.e.x.m.MlDailyMaintenanceService] [s-docker01] Successfully completed [ML] maintenance task: triggerDeleteExpiredDataTask
[2021-01-24T09:30:00,000][INFO ][o.e.x.s.SnapshotRetentionTask] [s-docker01] starting SLM retention snapshot cleanup task

This file has been truncated. show original

elastic-observability-2021-01-25-1.log

[2021-01-25T00:07:52,219][INFO ][o.e.n.Node               ] [s-docker01] version[7.10.2], pid[36395], build[default/deb/747e1cc71def077253878a59143c1f785afa92b9/2021-01-13T00:42:12.435326Z], OS[Linux/5.4.0-42-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/15.0.1/15.0.1+9]
[2021-01-25T00:07:52,231][INFO ][o.e.n.Node               ] [s-docker01] JVM home [/usr/share/elasticsearch/jdk], using bundled JDK [true]
[2021-01-25T00:07:52,232][INFO ][o.e.n.Node               ] [s-docker01] JVM arguments [-Xshare:auto, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -XX:+ShowCodeDetailsInExceptionMessages, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=SPI,COMPAT, -Xms15g, -Xmx15g, -XX:+UseG1GC, -XX:G1ReservePercent=25, -XX:InitiatingHeapOccupancyPercent=30, -Djava.io.tmpdir=/tmp/elasticsearch-17914499696916466326, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/elasticsearch, -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -XX:MaxDirectMemorySize=8053063680, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distribution.flavor=default, -Des.distribution.type=deb, -Des.bundled_jdk=true]
[2021-01-25T00:07:54,135][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [aggs-matrix-stats]
[2021-01-25T00:07:54,135][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [analysis-common]
[2021-01-25T00:07:54,135][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [constant-keyword]
[2021-01-25T00:07:54,136][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [flattened]
[2021-01-25T00:07:54,136][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [frozen-indices]
[2021-01-25T00:07:54,136][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [ingest-common]
[2021-01-25T00:07:54,136][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [ingest-geoip]

This file has been truncated. show original

elastic-observability-2021-01-25-2.log

[2021-01-26T00:10:21,500][INFO ][o.e.n.Node               ] [s-docker01] version[7.10.2], pid[3259], build[default/deb/747e1cc71def077253878a59143c1f785afa92b9/2021-01-13T00:42:12.435326Z], OS[Linux/5.4.0-42-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/15.0.1/15.0.1+9]
[2021-01-26T00:10:21,513][INFO ][o.e.n.Node               ] [s-docker01] JVM home [/usr/share/elasticsearch/jdk], using bundled JDK [true]
[2021-01-26T00:10:21,513][INFO ][o.e.n.Node               ] [s-docker01] JVM arguments [-Xshare:auto, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -XX:+ShowCodeDetailsInExceptionMessages, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=SPI,COMPAT, -Xms16g, -Xmx16g, -XX:UseAVX=2, -XX:+UseG1GC, -XX:G1ReservePercent=25, -XX:InitiatingHeapOccupancyPercent=30, -Djava.io.tmpdir=/tmp/elasticsearch-5466461359207416476, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/elasticsearch, -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -XX:MaxDirectMemorySize=8589934592, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distribution.flavor=default, -Des.distribution.type=deb, -Des.bundled_jdk=true]
[2021-01-26T00:10:23,553][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [aggs-matrix-stats]
[2021-01-26T00:10:23,553][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [analysis-common]
[2021-01-26T00:10:23,553][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [constant-keyword]
[2021-01-26T00:10:23,553][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [flattened]
[2021-01-26T00:10:23,554][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [frozen-indices]
[2021-01-26T00:10:23,554][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [ingest-common]
[2021-01-26T00:10:23,554][INFO ][o.e.p.PluginsService     ] [s-docker01] loaded module [ingest-geoip]

This file has been truncated. show original

There are more than three files. show original

DavidTurner · January 28, 2021, 8:21am

[Mon Jan 25 22:05:40 2021] node[1175]: segfault at 1 ip 0000000000000001 sp 00007ffee4ae5248 error 14 in node[400000+204e000]
[Mon Jan 25 22:05:40 2021] Code: Bad RIP value.
[Tue Jan 26 20:48:05 2021] traps: node[5038] trap invalid opcode ip:17aad89 sp:7fc92b992898 error:0 in node[400000+204e000]

This looks like bad hardware, although more likely bad RAM or CPU rather than storage. Does this reproduce on a different machine?

billhong-just · February 2, 2021, 3:04am

I have moved my single node elasticsearch cluster to a different machine, and it runs healthily for about 4 days.
And there is no hardware error message in the output of dmesg.
I will keep watching and report here.

Thanks for your help.

billhong-just · February 22, 2021, 1:19am

Hi community,
It has been about 3 weeks since my last report.
And my elasticsearch cluster stays healthy since then.
We can confirm the problem is about bad hardware.
Thank you all.

Topic		Replies	Views
Elasticsearch crashed suddenly with 7.1.1 Elasticsearch	2	1003	July 22, 2019
Elasticsearch crashing Java Runtime Environment Elasticsearch	0	670	October 2, 2020
Elasticsearch cause linux kernel crash Elasticsearch	8	1351	June 26, 2019
ElasticSearch crashes OS? Elasticsearch	12	514	May 1, 2011
Master node keeps crashing Elasticsearch	4	1741	June 10, 2014

Jre crash after running for days or hours

Elasticsearch version (bin/elasticsearch --version):

Plugins installed:

JVM version (java -version):

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

Steps to reproduce:

Provide logs (if relevant):

Provide logs (if relevant):

Related topics