Data node removed; master_failed

Hi,
I have a cluster with 3x client, 3x master & 3xdata
It has been running for weeks without any issue but today I'm suddenly facing some error.

ES Version: 5.1.1

Symptom

  • My cluster and all indices turned yellow
  • All nodes is green
  • 70 unassigned shards out of total 224

Hypothesis
Data node 03 unable to talk to master, so master decided to reallocate shards to available data node.

Incident Time: 11:34am
Recovery Time: 11:38am

Jun 2 11:34:03 es-master-02 es-production-cluster.log WARN [o.e.c.a.s.ShardStateAction] [es-master-02.cluster123.internal] [index20170525][2] received shard failed for shard id [[index20170525][2]], allocation id [xLjSTaN3T5evzL5091IaQg], primary term [1], message [mark copy as stale]
Jun 2 11:34:03 es-data-03 es-production-cluster.log  nt nodes: nodes: 
   {es-client-03.cluster123.internal}{xYH-LxRhTfa2-ivGF8xYtg}{3n6ZS3rYQzet6CylzWuI_A}{10.0.10.124}{10.0.10.124:9300}
   {es-master-03.cluster123.internal}{wUy1UMaURc-s4So_hngIBw}{MjIvlZXQSfuNtmmkY7pVHw}{10.0.10.41}{10.0.10.41:9300}
   {es-data-02.cluster123.internal}{LD551HnKSPOVLD-rda9X0A}{oPMl6McNTp6zKefK6mdMOA}{10.0.10.58}{10.0.10.58:9300}
   {es-master-01.cluster123.internal}{yt5mqQadRtyvx3w2Kc4KSA}{aB9ge6UfQnuIo_TmyZaRtg}{10.0.10.96}{10.0.10.96:9300}
   {es-data-03.cluster123.internal}{tlaG3OlNTKquzNT0IJkLWA}{xKi2IznpToqC5j436FYUuw}{10.0.10.156}{10.0.10.156:9300}, local
   {es-client-02.cluster123.internal}{D34sEfjyQzy_4iHere-hTw}{YPdWK9urR-6u1FsWmACTfQ}{10.0.10.108}{10.0.10.108:9300}
   {es-client-01.cluster123.internal}{dWiuPoPnQgakiNTT1YDC0A}{LBUZELNITf-YFOmFGVACLA}{10.0.10.14}{10.0.10.14:9300}
   {es-data-01.cluster123.internal}{0rPrYbWZTjauAulRYuG7IQ}{iF8HhNCUQcuanAZ1gWRFhA}{10.0.10.166}{10.0.10.166:9300}
Jun 2 11:34:03 es-data-03 es-production-cluster.log
Jun 2 11:34:03 es-data-03 es-production-cluster.log INFO [o.e.c.s.ClusterService] [es-data-03.cluster123.internal] removed {{es-master-02.cluster123.internal}{SSyCmD3mRvWcADDbf0co5w}{H8SLzxZXTlqqj1II8rK3Xg}{10.0.10.69}{10.0.10.69:9300},}, reason: master_failed ({es-master-02.cluster123.internal}{SSyCmD3mRvWcADDbf0co5w}{H8SLzxZXTlqqj1II8rK3Xg}{10.0.10.69}{10.0.10.69:9300})
Jun 2 11:34:04 es-master-02 es-production-cluster.log WARN [o.e.c.a.s.ShardStateAction] [es-master-02.cluster123.internal] [index20170525][4] received shard failed for shard id [[index20170525][4]], allocation id [qd6gLmuqThe8okSQyi11vw], primary term [2], message [mark copy as stale]

That is the whole cluster log. I don't see any reason for data node 03 to be down actually. CPU, memory & network works fine. What baffles me even more is, how does is recover automatically after 4 minutes without any message in log.

Would be great if someone can provide more insight into this.

Thanks in advance.

What version?

Ah sorry miss that one out. Its version 5.1.1
I've added in my original post

Is there anything in the logs of the master node?

There's 1 entry in the master log

Jun 2 11:34:02 es-master-02 es-production-cluster.log WARN [o.e.t.n.Netty4Transport] [es-master-02.cluster123.internal] exception caught on transport layer [[id: 0x0eb63c31, L:/10.0.10.69:51014 - R:10.0.10.156/10.0.10.156:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [38627469], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1@1b234b31], error [false]; resetting
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1257) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[transport-netty4-5.1.1.jar:5.1.1]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:280) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:396) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) [netty-common-4.1.6.Final.jar:4.1.6.Final]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]

Anything you can deduce from above error log?

Not really sorry.
There's nothing else in the logs? Has it happened again?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.