Hi,
My setup:
- cluster of 3 nodes on dedicated Ubuntu 16.04.3 LTS VMs on Azure
- ES version: 5.6.2
- JRE: 1.8.0_144-b01 (Oracle)
- Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 )
My problem:
- One of the nodes started to crash once in a while (every 1-2 days) since it was upgraded from 5.5.x to 5.6.2. Other nodes were upgraded as well but not affected.
- Crash traces can be visible in syslog files and reveal SIGABRT-triggered terminations:
syslog.4:Sep 30 01:10:24 elastic-node-vm1 systemd[1]: elasticsearch.service: Main process exited, code=killed, status=6/ABRT
syslog.4:Sep 30 01:10:24 elastic-node-vm1 systemd[1]: elasticsearch.service: Unit entered failed state.
syslog.4:Sep 30 01:10:24 elastic-node-vm1 systemd[1]: elasticsearch.service: Failed with result 'signal'.[...]
syslog.1:Oct 2 20:18:20 elastic-node-vm1 systemd[1]: elasticsearch.service: Main process exited, code=killed, status=6/ABRT
syslog.1:Oct 2 20:18:20 elastic-node-vm1 systemd[1]: elasticsearch.service: Unit entered failed state.
syslog.1:Oct 2 20:18:20 elastic-node-vm1 systemd[1]: elasticsearch.service: Failed with result 'signal'.
- Checking /tmp/hs_err_*.log files reveals JVM errors with SIGSEGV happening about 2 hours before Elasticsearch service crashes:
In hs_err_pid43032.log (created on Sep 29 23:34):
SIGSEGV (0xb) at pc=0x00007feb4f1f86d3, pid=43032, tid=0x00007fe1fcbe1700
JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 )
Problematic frame:
V [libjvm.so+0x9176d3] OopMapSet::find_map_at_offset(int) const+0x43
Current thread (0x00007fe244023000): JavaThread "elasticsearch[elastic-node-vm1][http_server_worker][T#9]" daemon [_thread_in_Java, id=43305, stack(0x00007fe1fcae1000,0x00007fe1fcbe2000)]
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000
[...]
Stack: [0x00007fe1fcae1000,0x00007fe1fcbe2000], sp=0x00007fe1fcbdfb40, free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x9176d3] OopMapSet::find_map_at_offset(int) const+0x43
V [libjvm.so+0x918d38] OopMapSet::update_register_map(frame const*, RegisterMap*)+0x78
V [libjvm.so+0x593dd5] frame::sender_for_compiled_frame(RegisterMap*) const+0xf5
V [libjvm.so+0x594057] frame::sender(RegisterMap*) const+0x67
V [libjvm.so+0x9bd91e] OptoRuntime::is_deoptimized_caller_frame(JavaThread*)+0x9e
V [libjvm.so+0x9be6d6] OptoRuntime::new_array_nozero_C(Klass*, int, JavaThread*)+0x116
v ~RuntimeStub::_new_array_nozero_Java
J 21013 C2 org.elasticsearch.common.Strings.splitStringToSet(Ljava/lang/String;C)Ljava/util/Set; (215 bytes) @ 0x00007feb3a1d0d20 [0x00007feb3a1d0420+0x900]
J 24756 C2 org.elasticsearch.rest.action.admin.cluster.RestNodesInfoAction.prepareRequest(Lorg/elasticsearch/rest/RestRequest;Lorg/elasticsearch/client/node/NodeClient;)Lorg/elasticsearch/rest/BaseRestHandler$RestChannelConsumer; (349 bytes) @ 0x00007feb3cab7e30 [0x00007feb3cab76a0+0x790]
J 23668 C2 org.elasticsearch.rest.BaseRestHandler.handleRequest(Lorg/elasticsearch/rest/RestRequest;Lorg/elasticsearch/rest/RestChannel;Lorg/elasticsearch/client/node/NodeClient;)V (117 bytes) @ 0x00007feb3c71d05c [0x00007feb3c71d020+0x3c]
J 23648 C2 io.netty.channel.SimpleChannelInboundHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (74 bytes) @ 0x00007feb3c725890 [0x00007feb3c723080+0x2810]
J 23821 C2 org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (53 bytes) @ 0x00007feb3a41e064 [0x00007feb3a41de20+0x244]
J 19094 C2 io.netty.handler.codec.MessageToMessageDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (156 bytes) @ 0x00007feb3b959f60 [0x00007feb3b959bc0+0x3a0]
J 23645 C2 io.netty.handler.codec.MessageToMessageCodec.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (10 bytes) @ 0x00007feb3bc913cc [0x00007feb3bc913a0+0x2c]
J 19094 C2 io.netty.handler.codec.MessageToMessageDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (156 bytes) @ 0x00007feb3b959f60 [0x00007feb3b959bc0+0x3a0]
J 19094 C2 io.netty.handler.codec.MessageToMessageDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (156 bytes) @ 0x00007feb3b959f60 [0x00007feb3b959bc0+0x3a0]
J 12763 C2 io.netty.handler.codec.ByteToMessageDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (317 bytes) @ 0x00007feb3ac504d8 [0x00007feb3ac50020+0x4b8]
J 10117 C2 io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (9 bytes) @ 0x00007feb3932b740 [0x00007feb3932b5c0+0x180]
J 9598 C2 io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (9 bytes) @ 0x00007feb3a5c0b40 [0x00007feb3a5c09c0+0x180]
J 9133 C2 io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(Ljava/util/Set;)V (127 bytes) @ 0x00007feb3a7902f8 [0x00007feb3a78f660+0xc98]
J 17444% C2 io.netty.channel.nio.NioEventLoop.run()V (221 bytes) @ 0x00007feb3b6a48dc [0x00007feb3b6a46a0+0x23c]
j io.netty.util.concurrent.SingleThreadEventExecutor$5.run()V+44
j java.lang.Thread.run()V+11
[...]
- Another example In hs_err_pid1804.log (created on Oct 2 18:19):
Problematic frame:
[libjvm.so+0x9176d3] OopMapSet::find_map_at_offset(int) const+0x43Stack: [0x00007f3e5d899000,0x00007f3e5d99a000], sp=0x00007f3e5d9980d0, free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x9176d3] OopMapSet::find_map_at_offset(int) const+0x43
V [libjvm.so+0x918d38] OopMapSet::update_register_map(frame const*, RegisterMap*)+0x78
V [libjvm.so+0x593dd5] frame::sender_for_compiled_frame(RegisterMap*) const+0xf5
V [libjvm.so+0x594057] frame::sender(RegisterMap*) const+0x67
V [libjvm.so+0xa708ce] JavaThread::nmethods_do(CodeBlobClosure*)+0x9e
V [libjvm.so+0xa7097b] Threads::nmethods_do(CodeBlobClosure*)+0x2b
V [libjvm.so+0xa30582] NMethodSweeper::mark_active_nmethods()+0x32
V [libjvm.so+0x9c0c75] SafepointSynchronize::do_cleanup_tasks()+0x95
V [libjvm.so+0x9c1926] SafepointSynchronize::begin()+0x406
V [libjvm.so+0xad2dcf] VMThread::loop()+0x1bf
V [libjvm.so+0xad3250] VMThread::run()+0x70
V [libjvm.so+0x92a338] java_start(Thread*)+0x108
So maybe the culprit is a bug in JVM / something related to OopMapSet::find_map_at_offset
?
What would you recommend to do?
Thanks,
MG