Elasticsearch version
bin/elasticsearch --version
7.1.1
Plugins installed
bin/elasticsearch-plugin list
analysis-ik
JVM version
java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
OS version
uname -a
Linux log-es05.com 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Related config
node.master: false
node.data: true
node.ingest: true
node.ml: false
xpack.ml.enabled: true
issue
the cluster of my elasticsearch has runned for more than 2week , yesterday , one of the node crashed suddenly.
there are some info i can supply
1. the time of my es crashed
is near 2019-07-21 23:40:00
2. the hs_err_pid%p.log
this file give a lot of info , i give some i think important here
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f197d41866e, pid=22536, tid=139739997894400
#
# JRE version: Java(TM) SE Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 5754 sun.nio.ch.EPollArrayWrapper.epollWait(JIJI)I (0 bytes) @ 0x00007f197d41866e [0x00007f197d418580+0xee]
#
# Core dump written. Default location: /home/deploy/search/elasticsearch-7.1.1/core or core.22536
--------------- T H R E A D ---------------
Current thread (0x00007f183000a800): JavaThread "elasticsearch[ESV14][transport_worker][T#13]" daemon [_thread_in_native_trans, id=22779, stack(0x00007f17c0df7000,0x00007f17c0ef8000)]
siginfo: si_signo: 11 (SIGSEGV), si_code: 0 (SI_USER)
Registers:
RAX=0x0000000000000001, RBX=0x00007f17c0ef6540, RCX=0x0000000000000a80,
...
Top of Stack: (sp=0x00007f17c0ef6470)
0x00007f17c0ef6470: 00000000c58c9a90 00007f197f02fc30
...
0x00007f17c0ef6660:
Instructions: (pc=0x00007f197d41866e)
...
Register to memory mapping:
RAX=0x0000000000000001 is an unknown value
...
R13=0x0000000000000007 is an unknown value
R14=0x00000000c816bf50 is an oop
java.lang.Object
- klass: 'java/lang/Object'
R15=0x00007f183000a800 is a thread
Stack: [0x00007f17c0df7000,0x00007f17c0ef8000], sp=0x00007f17c0ef6470, free space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
J 5754 sun.nio.ch.EPollArrayWrapper.epollWait(JIJI)I (0 bytes) @ 0x00007f197d41866e [0x00007f197d418580+0xee]
J 10602 C2 sun.nio.ch.EPollArrayWrapper.poll(J)I (70 bytes) @ 0x00007f197e4ff6ac [0x00007f197e4ff5c0+0xec]
J 29708 C2 sun.nio.ch.EPollSelectorImpl.doSelect(J)I (124 bytes) @ 0x00007f1981dfc6dc [0x00007f1981dfc480+0x25c]
J 9858 C2 sun.nio.ch.SelectorImpl.select(J)I (34 bytes) @ 0x00007f197eac8984 [0x00007f197eac8800+0x184]
J 39902 C2 io.netty.channel.nio.NioEventLoop.select(Z)V (307 bytes) @ 0x00007f197e341248 [0x00007f197e3410e0+0x168]
J 24033% C2 io.netty.channel.nio.NioEventLoop.run()V (236 bytes) @ 0x00007f197fc13014 [0x00007f197fc12e80+0x194]
j io.netty.util.concurrent.SingleThreadEventExecutor$5.run()V+44
j java.lang.Thread.run()V+11
v ~StubRoutines::call_stub
V [libjvm.so+0x68dbc6] JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1056
V [libjvm.so+0x68e0d1] JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321
V [libjvm.so+0x68e567] JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x47
V [libjvm.so+0x7254b0] thread_entry(JavaThread*, Thread*)+0xa0
V [libjvm.so+0xa6b77f] JavaThread::thread_main_inner()+0xdf
V [libjvm.so+0xa6b8ac] JavaThread::run()+0x11c
V [libjvm.so+0x91ef78] java_start(Thread*)+0x108
C [libpthread.so.0+0x7aa1] start_thread+0xd1
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 5754 sun.nio.ch.EPollArrayWrapper.epollWait(JIJI)I (0 bytes) @ 0x00007f197d4185c8 [0x00007f197d418580+0x48]
J 10602 C2 sun.nio.ch.EPollArrayWrapper.poll(J)I (70 bytes) @ 0x00007f197e4ff6ac [0x00007f197e4ff5c0+0xec]
J 29708 C2 sun.nio.ch.EPollSelectorImpl.doSelect(J)I (124 bytes) @ 0x00007f1981dfc6dc [0x00007f1981dfc480+0x25c]
J 9858 C2 sun.nio.ch.SelectorImpl.select(J)I (34 bytes) @ 0x00007f197eac8984 [0x00007f197eac8800+0x184]
J 39902 C2 io.netty.channel.nio.NioEventLoop.select(Z)V (307 bytes) @ 0x00007f197e341248 [0x00007f197e3410e0+0x168]
J 24033% C2 io.netty.channel.nio.NioEventLoop.run()V (236 bytes) @ 0x00007f197fc13014 [0x00007f197fc12e80+0x194]
j io.netty.util.concurrent.SingleThreadEventExecutor$5.run()V+44
j java.lang.Thread.run()V+11
v ~StubRoutines::call_stub
....
3. self monitor
there are also some basic moitor for the mathine ,the picture below show them.
3.1 one cpu core`s idle is 0 during that time
3.2 the same core shows high iowait
4. core file
there is no jvm heapdump file create (may it means no out of memory happened)
but a core file find ,but it seems that the core file is truncted, i try to read it with gdb
and got the below
the New Thread line is 273 lines
$ gdb /opt/soft/jdk1.8.0_91/bin/java core.22536
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Reading symbols from /opt/soft/jdk1.8.0_91/bin/java...Missing separate debuginfo for /opt/soft/jdk1.8.0_91/bin/java
Try: yum --enablerepo='*-debug*' install /usr/lib/debug/.build-id/bd/74b7294ebbdd93e9ef3b729e5aab228a3f681b.debug
(no debugging symbols found)...done.
BFD: Warning: /data/temp/core.22536 is truncated: expected core file size >= 36030541824, found: 28792045568.
[New Thread 22779]
[New Thread 22780]
......
[New Thread 22544]
[New Thread 22798]
[New Thread 22564]
Cannot access memory at address 0x7f1993668168
Cannot access memory at address 0x7f1993668168
Cannot access memory at address 0x7f1993668168
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Failed to read a valid object file image from memory.
Core was generated by `/usr/local/jdk1.8.0_91/bin/java -Xms30g -Xmx30g -XX:+UseG1GC -XX:MaxGCPauseMill'.
Program terminated with signal 6, Aborted.
#0 0x00007f1992aae5e5 in ?? ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.209.el6_9.2.x86_64
sorry for boring you , but i have really no way to deal it .
after the crash , i restart the elasticsearch , and it works well till now .