Hi,
I have a cluster with 3x client, 3x master & 3xdata
It has been running for weeks without any issue but today I'm suddenly facing some error.
ES Version: 5.1.1
Symptom
My cluster and all indices turned yellow
All nodes is green
70 unassigned shards out of total 224
Hypothesis
Data node 03 unable to talk to master, so master decided to reallocate shards to available data node.
Incident Time: 11:34am Recovery Time: 11:38am
Jun 2 11:34:03 es-master-02 es-production-cluster.log WARN [o.e.c.a.s.ShardStateAction] [es-master-02.cluster123.internal] [index20170525][2] received shard failed for shard id [[index20170525][2]], allocation id [xLjSTaN3T5evzL5091IaQg], primary term [1], message [mark copy as stale]
Jun 2 11:34:03 es-data-03 es-production-cluster.log nt nodes: nodes:
{es-client-03.cluster123.internal}{xYH-LxRhTfa2-ivGF8xYtg}{3n6ZS3rYQzet6CylzWuI_A}{10.0.10.124}{10.0.10.124:9300}
{es-master-03.cluster123.internal}{wUy1UMaURc-s4So_hngIBw}{MjIvlZXQSfuNtmmkY7pVHw}{10.0.10.41}{10.0.10.41:9300}
{es-data-02.cluster123.internal}{LD551HnKSPOVLD-rda9X0A}{oPMl6McNTp6zKefK6mdMOA}{10.0.10.58}{10.0.10.58:9300}
{es-master-01.cluster123.internal}{yt5mqQadRtyvx3w2Kc4KSA}{aB9ge6UfQnuIo_TmyZaRtg}{10.0.10.96}{10.0.10.96:9300}
{es-data-03.cluster123.internal}{tlaG3OlNTKquzNT0IJkLWA}{xKi2IznpToqC5j436FYUuw}{10.0.10.156}{10.0.10.156:9300}, local
{es-client-02.cluster123.internal}{D34sEfjyQzy_4iHere-hTw}{YPdWK9urR-6u1FsWmACTfQ}{10.0.10.108}{10.0.10.108:9300}
{es-client-01.cluster123.internal}{dWiuPoPnQgakiNTT1YDC0A}{LBUZELNITf-YFOmFGVACLA}{10.0.10.14}{10.0.10.14:9300}
{es-data-01.cluster123.internal}{0rPrYbWZTjauAulRYuG7IQ}{iF8HhNCUQcuanAZ1gWRFhA}{10.0.10.166}{10.0.10.166:9300}
Jun 2 11:34:03 es-data-03 es-production-cluster.log
Jun 2 11:34:03 es-data-03 es-production-cluster.log INFO [o.e.c.s.ClusterService] [es-data-03.cluster123.internal] removed {{es-master-02.cluster123.internal}{SSyCmD3mRvWcADDbf0co5w}{H8SLzxZXTlqqj1II8rK3Xg}{10.0.10.69}{10.0.10.69:9300},}, reason: master_failed ({es-master-02.cluster123.internal}{SSyCmD3mRvWcADDbf0co5w}{H8SLzxZXTlqqj1II8rK3Xg}{10.0.10.69}{10.0.10.69:9300})
Jun 2 11:34:04 es-master-02 es-production-cluster.log WARN [o.e.c.a.s.ShardStateAction] [es-master-02.cluster123.internal] [index20170525][4] received shard failed for shard id [[index20170525][4]], allocation id [qd6gLmuqThe8okSQyi11vw], primary term [2], message [mark copy as stale]
That is the whole cluster log. I don't see any reason for data node 03 to be down actually. CPU, memory & network works fine. What baffles me even more is, how does is recover automatically after 4 minutes without any message in log.
Would be great if someone can provide more insight into this.
Jun 2 11:34:02 es-master-02 es-production-cluster.log WARN [o.e.t.n.Netty4Transport] [es-master-02.cluster123.internal] exception caught on transport layer [[id: 0x0eb63c31, L:/10.0.10.69:51014 - R:10.0.10.156/10.0.10.156:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [38627469], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1@1b234b31], error [false]; resetting
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1257) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[transport-netty4-5.1.1.jar:5.1.1]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:280) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:396) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248) [netty-codec-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) [netty-common-4.1.6.Final.jar:4.1.6.Final]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.