Nodes frequently removed from cluster version 7.3.2

This issue has been plaguing us for quite sometime and various attempts at solving it haven't produced anything. Any help is greatly appreciated. We're seeing issues where the data nodes drop from the cluster. Sometimes they'll rejoin shortly after, other times they don't. There doesn't seem to be a any predictable pattern, sometimes they'll go days without any drops and other times there will be frequent drops throughout the day.

Stats:

  • Version 7.3.2
  • 3 Dedicated Master Nodes
  • Numerous dedicated coordinator nodes
  • Numerous dedicated data nodes
  • Data ingested from logstash
  • ~2k shards across the cluster, ranging from 50gb to 500mb.
  • Running on Docker
  • Data nodes have 28gb ram with another 70 left for the OS.

In the master logs I see the following:

[2020-06-13T05:12:30,585][INFO ][o.e.c.s.MasterService ] [masternode.name] node-left[{datanode.name}{IhwnvXp3TxKcqf8SjJ4KAA}{QWFJwJdFQy-vrW4GrQ3Dnw}{10.47.150.114}{10.47.150.114:9300}{d} lagging], term: 76, version: 340215, reason: removed {{datanode.name}{IhwnvXp3TxKcqf8SjJ4KAA}{QWFJwJdFQy-vrW4GrQ3Dnw}{10.47.150.114}{10.47.150.114:9300}{d},}

And in the data node logs I see:

[2020-06-13T05:08:23,637][INFO ][o.e.c.c.Coordinator      ] [datanode.name] master node [{masternode.name}{MtEn_IWjT7OQuntMn03QzQ}{XHo0RVYgRJ6dcqSG_aBjmA}{10.32.172.234}{10.32.172.234:9300}{m}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{masternode.name}{MtEn_IWjT7OQuntMn03QzQ}{XHo0RVYgRJ6dcqSG_aBjmA}{10.32.172.234}{10.32.172.234:9300}{m}] failed [3] consecutive checks
	at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:278) ~[elasticsearch-7.3.2.jar:7.3.2]
	at com.amazon.opendistroforelasticsearch.security.transport.OpenDistroSecurityInterceptor$RestoringTransportResponseHandler.handleException(OpenDistroSecurityInterceptor.java:277) ~[?:?]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:244) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:236) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:139) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1478) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1227) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [masternode.name][10.32.172.234:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: leader check from unknown node
	at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:181) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.cluster.coordination.LeaderChecker.lambda$new$0(LeaderChecker.java:105) ~[elasticsearch-7.3.2.jar:7.3.2]
	at com.amazon.opendistro.elasticsearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43) ~[?:?]
	at com.amazon.opendistroforelasticsearch.security.ssl.transport.OpenDistroSecuritySSLRequestHandler.messageReceivedDecorate(OpenDistroSecuritySSLRequestHandler.java:164) ~[?:?]
	at com.amazon.opendistroforelasticsearch.security.transport.OpenDistroSecurityRequestHandler.messageReceivedDecorate(OpenDistroSecurityRequestHandler.java:292) ~[?:?]
	at com.amazon.opendistroforelasticsearch.security.ssl.transport.OpenDistroSecuritySSLRequestHandler.messageReceived(OpenDistroSecuritySSLRequestHandler.java:136) ~[?:?]
	at com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin$7$1.messageReceived(OpenDistroSecurityPlugin.java:635) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:267) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:188) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1478) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1227) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at java.lang.Thread.run(Thread.java:834) ~[?:?]

I've added the following keepalive settings to the all hosts in the cluster:

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_probes = 6
net.ipv4.tcp_keepalive_intvl = 10

Any idea what reason: removed means and why a node would be removed?

The reason for removal is actually the word lagging: this node did not apply cluster state updates within a (2-minute) timeout so was removed to protect the rest of the cluster.

Can you upgrade? More recent versions have more detailed logging in this area, and some better handling of nodes with slower disks which is the usual root cause of lagging nodes.

2 Likes

Thanks! Upgraded to 7.7.0 and haven't seen an issue since.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.