I'm really stumped here so I'm hoping that someone can point me in the right direction. Thanks!
Over the past couple weeks I've noticed that nodes will occasionally leave our cluster and then rejoin a short time later and I don't understand why. And by "occasionally" I really mean "occasionally", like it didn't happen for the past week and then just today 2 data nodes left, the cluster recovered, then 4 data nodes left, the cluster recovered, and then one of the master eligible nodes left. The traffic today didn't seem abnormal.
Cluster stats
- The cluster runs on AWS
- ES version 7.3.2
- 3 dedicated masters (r4.2xlarge)
- 3 dedicated coordinating (r4.2xlarge)
- 5 dedicated ingest (r4.2xlarge)
- 20 dedicated data nodes (i3.2xlarge)
- 7,166 total shards
- about 6.5 TB of storage
- 30GB of heap space
The configuration on one of the data nodes that left (every other configuration is pretty much the same except for the different node roles):
discovery:
seed_providers: ec2
ec2.tag.elastic_role: master
ec2.tag.cluster_name: elasticsearch.platform.rate.com
cluster:
initial_master_nodes: ""
name: elasticsearch.platform.rate.com
node:
master: false
data: true
ingest: false
ml: false
name: ""
attr.box_type: warm
network.host: 0.0.0.0
xpack:
ml.enabled: false
monitoring:
enabled: true
elasticsearch.collection.enabled: true
collection.enabled: true
The log which I'm seeing on the data node before it leaves:
{"type": "server", "timestamp": "2020-03-24T21:52:22,579+0000", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "elasticsearch.platform.rate.com", "node.name": "ip-10-101-20-143", "cl
uster.uuid": "0mOjFZ1wTy6OMWkB6mKyNw", "node.id": "z65w-XMGRcac9SrR57KyWw", "message": "master node [{master-1}{uUNpXOzCRt6IZAE8VWAdeA}{W5hOAQ5qR9G6tAqxep7gfg}{10.101.23.77}{10.101.23.77:9300}{m}{xpack.i
nstalled=true, box_type=master}] failed, restarting discovery" ,
"stacktrace": ["org.elasticsearch.ElasticsearchException: node [{master-1}{uUNpXOzCRt6IZAE8VWAdeA}{W5hOAQ5qR9G6tAqxep7gfg}{10.101.23.77}{10.101.23.77:9300}{m}{xpack.installed=true, box_type=master}] faile
d [3] consecutive checks",
"at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:278) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:244) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:236) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:139) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]",
"at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) ~[?:?]",
"at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]",
"at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]",
"at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]",
"Caused by: org.elasticsearch.transport.RemoteTransportException: [master-1][10.101.23.77:9300][internal:coordination/fault_detection/leader_check]",
"Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: leader check from unknown node",
This seems to be the reason the node left:
failed, restarting discovery",
"stacktrace": ["org.elasticsearch.ElasticsearchException: node [{master-1}{uUNpXOzCRt6IZAE8VWAdeA}{W5hOAQ5qR9G6tAqxep7gfg}{10.101.23.77}{10.101.23.77:9300}{m}{xpack.installed=true, box_type=master}]
failed [3] consecutive checks",
But I'm stumped as to why the master failed these checks. It doesn't seem overworked if I look at the CPU, JVM usage, GC invocations, etc...
I don't see anything that seems noteworthy on the logs for the elected master with regards to nodes leaving. I do see a fair number of errors with indexing the monitoring information. Perhaps I should turn that setting off?
Thanks again for any and all help!