Data node removed; master_failed

Nor · June 2, 2017, 8:55am

Hi,
I have a cluster with 3x client, 3x master & 3xdata
It has been running for weeks without any issue but today I'm suddenly facing some error.

ES Version: 5.1.1

Symptom

My cluster and all indices turned yellow
All nodes is green
70 unassigned shards out of total 224

Hypothesis
Data node 03 unable to talk to master, so master decided to reallocate shards to available data node.

Incident Time: 11:34am
Recovery Time: 11:38am

Jun 2 11:34:03 es-master-02 es-production-cluster.log WARN [o.e.c.a.s.ShardStateAction] [es-master-02.cluster123.internal] [index20170525][2] received shard failed for shard id [[index20170525][2]], allocation id [xLjSTaN3T5evzL5091IaQg], primary term [1], message [mark copy as stale]
Jun 2 11:34:03 es-data-03 es-production-cluster.log  nt nodes: nodes: 
   {es-client-03.cluster123.internal}{xYH-LxRhTfa2-ivGF8xYtg}{3n6ZS3rYQzet6CylzWuI_A}{10.0.10.124}{10.0.10.124:9300}
   {es-master-03.cluster123.internal}{wUy1UMaURc-s4So_hngIBw}{MjIvlZXQSfuNtmmkY7pVHw}{10.0.10.41}{10.0.10.41:9300}
   {es-data-02.cluster123.internal}{LD551HnKSPOVLD-rda9X0A}{oPMl6McNTp6zKefK6mdMOA}{10.0.10.58}{10.0.10.58:9300}
   {es-master-01.cluster123.internal}{yt5mqQadRtyvx3w2Kc4KSA}{aB9ge6UfQnuIo_TmyZaRtg}{10.0.10.96}{10.0.10.96:9300}
   {es-data-03.cluster123.internal}{tlaG3OlNTKquzNT0IJkLWA}{xKi2IznpToqC5j436FYUuw}{10.0.10.156}{10.0.10.156:9300}, local
   {es-client-02.cluster123.internal}{D34sEfjyQzy_4iHere-hTw}{YPdWK9urR-6u1FsWmACTfQ}{10.0.10.108}{10.0.10.108:9300}
   {es-client-01.cluster123.internal}{dWiuPoPnQgakiNTT1YDC0A}{LBUZELNITf-YFOmFGVACLA}{10.0.10.14}{10.0.10.14:9300}
   {es-data-01.cluster123.internal}{0rPrYbWZTjauAulRYuG7IQ}{iF8HhNCUQcuanAZ1gWRFhA}{10.0.10.166}{10.0.10.166:9300}
Jun 2 11:34:03 es-data-03 es-production-cluster.log
Jun 2 11:34:03 es-data-03 es-production-cluster.log INFO [o.e.c.s.ClusterService] [es-data-03.cluster123.internal] removed {{es-master-02.cluster123.internal}{SSyCmD3mRvWcADDbf0co5w}{H8SLzxZXTlqqj1II8rK3Xg}{10.0.10.69}{10.0.10.69:9300},}, reason: master_failed ({es-master-02.cluster123.internal}{SSyCmD3mRvWcADDbf0co5w}{H8SLzxZXTlqqj1II8rK3Xg}{10.0.10.69}{10.0.10.69:9300})
Jun 2 11:34:04 es-master-02 es-production-cluster.log WARN [o.e.c.a.s.ShardStateAction] [es-master-02.cluster123.internal] [index20170525][4] received shard failed for shard id [[index20170525][4]], allocation id [qd6gLmuqThe8okSQyi11vw], primary term [2], message [mark copy as stale]

That is the whole cluster log. I don't see any reason for data node 03 to be down actually. CPU, memory & network works fine. What baffles me even more is, how does is recover automatically after 4 minutes without any message in log.

Would be great if someone can provide more insight into this.

Thanks in advance.

warkolm · June 2, 2017, 9:25am

What version?

Nor · June 2, 2017, 9:46am

Ah sorry miss that one out. Its version 5.1.1
I've added in my original post

warkolm · June 2, 2017, 9:46am

Is there anything in the logs of the master node?

Nor · June 2, 2017, 9:53am

There's 1 entry in the master log

Jun 2 11:34:02 es-master-02 es-production-cluster.log WARN [o.e.t.n.Netty4Transport] [es-master-02.cluster123.internal] exception caught on transport layer [[id: 0x0eb63c31, L:/10.0.10.69:51014 - R:10.0.10.156/10.0.10.156:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [38627469], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1@1b234b31], error [false]; resetting
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1257) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[transport-netty4-5.1.1.jar:5.1.1]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:280) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:396) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) [netty-common-4.1.6.Final.jar:4.1.6.Final]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]

Nor · June 6, 2017, 5:57am

Anything you can deduce from above error log?

warkolm · June 6, 2017, 9:59am

Not really sorry.
There's nothing else in the logs? Has it happened again?

system · July 4, 2017, 10:00am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Master has not removed previously failed shard. resending shard failure Elasticsearch	2	660	August 17, 2022
ES datanode can't join the cluster since disconnected from master Elasticsearch	7	1965	December 28, 2020
ES Cluster Health in yellow due to some Replica shards in Unassigned state Elasticsearch	2	527	December 30, 2018
Elasticsearch:primary shards lost, after server restart due to power failure on datacenter Elasticsearch	3	539	March 15, 2021
Shards UNASSIGNED even tho they exist on disk Elasticsearch	3	523	July 6, 2017

Data node removed; master_failed

Related topics