Data node removed; master_failed


(Nor) #1

Hi,
I have a cluster with 3x client, 3x master & 3xdata
It has been running for weeks without any issue but today I'm suddenly facing some error.

ES Version: 5.1.1

Symptom

  • My cluster and all indices turned yellow
  • All nodes is green
  • 70 unassigned shards out of total 224

Hypothesis
Data node 03 unable to talk to master, so master decided to reallocate shards to available data node.

Incident Time: 11:34am
Recovery Time: 11:38am

Jun 2 11:34:03 es-master-02 es-production-cluster.log WARN [o.e.c.a.s.ShardStateAction] [es-master-02.cluster123.internal] [index20170525][2] received shard failed for shard id [[index20170525][2]], allocation id [xLjSTaN3T5evzL5091IaQg], primary term [1], message [mark copy as stale]
Jun 2 11:34:03 es-data-03 es-production-cluster.log  nt nodes: nodes: 
   {es-client-03.cluster123.internal}{xYH-LxRhTfa2-ivGF8xYtg}{3n6ZS3rYQzet6CylzWuI_A}{10.0.10.124}{10.0.10.124:9300}
   {es-master-03.cluster123.internal}{wUy1UMaURc-s4So_hngIBw}{MjIvlZXQSfuNtmmkY7pVHw}{10.0.10.41}{10.0.10.41:9300}
   {es-data-02.cluster123.internal}{LD551HnKSPOVLD-rda9X0A}{oPMl6McNTp6zKefK6mdMOA}{10.0.10.58}{10.0.10.58:9300}
   {es-master-01.cluster123.internal}{yt5mqQadRtyvx3w2Kc4KSA}{aB9ge6UfQnuIo_TmyZaRtg}{10.0.10.96}{10.0.10.96:9300}
   {es-data-03.cluster123.internal}{tlaG3OlNTKquzNT0IJkLWA}{xKi2IznpToqC5j436FYUuw}{10.0.10.156}{10.0.10.156:9300}, local
   {es-client-02.cluster123.internal}{D34sEfjyQzy_4iHere-hTw}{YPdWK9urR-6u1FsWmACTfQ}{10.0.10.108}{10.0.10.108:9300}
   {es-client-01.cluster123.internal}{dWiuPoPnQgakiNTT1YDC0A}{LBUZELNITf-YFOmFGVACLA}{10.0.10.14}{10.0.10.14:9300}
   {es-data-01.cluster123.internal}{0rPrYbWZTjauAulRYuG7IQ}{iF8HhNCUQcuanAZ1gWRFhA}{10.0.10.166}{10.0.10.166:9300}
Jun 2 11:34:03 es-data-03 es-production-cluster.log
Jun 2 11:34:03 es-data-03 es-production-cluster.log INFO [o.e.c.s.ClusterService] [es-data-03.cluster123.internal] removed {{es-master-02.cluster123.internal}{SSyCmD3mRvWcADDbf0co5w}{H8SLzxZXTlqqj1II8rK3Xg}{10.0.10.69}{10.0.10.69:9300},}, reason: master_failed ({es-master-02.cluster123.internal}{SSyCmD3mRvWcADDbf0co5w}{H8SLzxZXTlqqj1II8rK3Xg}{10.0.10.69}{10.0.10.69:9300})
Jun 2 11:34:04 es-master-02 es-production-cluster.log WARN [o.e.c.a.s.ShardStateAction] [es-master-02.cluster123.internal] [index20170525][4] received shard failed for shard id [[index20170525][4]], allocation id [qd6gLmuqThe8okSQyi11vw], primary term [2], message [mark copy as stale]

That is the whole cluster log. I don't see any reason for data node 03 to be down actually. CPU, memory & network works fine. What baffles me even more is, how does is recover automatically after 4 minutes without any message in log.

Would be great if someone can provide more insight into this.

Thanks in advance.


(Mark Walkom) #2

What version?


(Nor) #3

Ah sorry miss that one out. Its version 5.1.1
I've added in my original post


(Mark Walkom) #4

Is there anything in the logs of the master node?


(Nor) #5

There's 1 entry in the master log

Jun 2 11:34:02 es-master-02 es-production-cluster.log WARN [o.e.t.n.Netty4Transport] [es-master-02.cluster123.internal] exception caught on transport layer [[id: 0x0eb63c31, L:/10.0.10.69:51014 - R:10.0.10.156/10.0.10.156:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [38627469], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1@1b234b31], error [false]; resetting
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1257) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[transport-netty4-5.1.1.jar:5.1.1]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:280) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:396) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) [netty-common-4.1.6.Final.jar:4.1.6.Final]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]

(Nor) #6

Anything you can deduce from above error log?


(Mark Walkom) #7

Not really sorry.
There's nothing else in the logs? Has it happened again?


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.