I have an elasticsearch cluster running on AWS EC2 instances . I frequently encounter issues of nodes disconnecting from the cluster and joining back in about 60 seconds. But this affects the health of the cluster's stability and health.
Error from elasticsearch logs
[2019-09-23T02:39:52,249][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [
mtr1] failed to execute [indices:monitor/stats] on node [1UNOXUgGRNC2HX6wbBFbng]
org.elasticsearch.transport.NodeNotConnectedException: [dat16][x.x.x.x:9300] Node not connected
at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:621) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:115) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:513) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:476) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.sendNodeRequest(TransportBroadcastByNodeAction.java:322) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.start(TransportBroadcastByNodeAction.java:311) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:234) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:79) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1256) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.stats(AbstractClient.java:1577) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.InternalClusterInfoService.updateIndicesStats(InternalClusterInfoService.java:270) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.InternalClusterInfoService.refresh(InternalClusterInfoService.java:321) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.InternalClusterInfoService.maybeRefresh(InternalClusterInfoService.java:277) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.InternalClusterInfoService.access$500(InternalClusterInfoService.java:67) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.InternalClusterInfoService$SubmitReschedulingClusterInfoUpdatedJob.lambda$run$0(InternalClusterInfoService.java:224) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.0.jar:5.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
Cluster config:
3 master nodes - c3.2xlarge
16 dat nodes - i3.2xlarge
Number of shards - 48000
Heap allocation on dat nodes = 32GB
Delayed allocation timeout = 180m ( so as to minimize shard allocation to other nodes on disconnect)
Use case of cluster is for product related search
Would appreciate assistance on identifying the cause of issue. Let me know if additional inputs are needed.
--Syd