Nodes disconnect and rejoin the cluster on elasticsearch 5.4.0

I have an elasticsearch cluster running on AWS EC2 instances . I frequently encounter issues of nodes disconnecting from the cluster and joining back in about 60 seconds. But this affects the health of the cluster's stability and health.
Error from elasticsearch logs

[2019-09-23T02:39:52,249][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [
mtr1] failed to execute [indices:monitor/stats] on node [1UNOXUgGRNC2HX6wbBFbng]
org.elasticsearch.transport.NodeNotConnectedException: [dat16][x.x.x.x:9300] Node not connected
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:621) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:115) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:513) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:476) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.sendNodeRequest(TransportBroadcastByNodeAction.java:322) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.start(TransportBroadcastByNodeAction.java:311) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:234) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:79) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1256) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.stats(AbstractClient.java:1577) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.updateIndicesStats(InternalClusterInfoService.java:270) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.refresh(InternalClusterInfoService.java:321) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.maybeRefresh(InternalClusterInfoService.java:277) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.access$500(InternalClusterInfoService.java:67) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.InternalClusterInfoService$SubmitReschedulingClusterInfoUpdatedJob.lambda$run$0(InternalClusterInfoService.java:224) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.0.jar:5.4.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]

Cluster config:
3 master nodes - c3.2xlarge
16 dat nodes - i3.2xlarge
Number of shards - 48000
Heap allocation on dat nodes = 32GB
Delayed allocation timeout = 180m ( so as to minimize shard allocation to other nodes on disconnect)
Use case of cluster is for product related search

Would appreciate assistance on identifying the cause of issue. Let me know if additional inputs are needed.

--Syd

before doing anything else, you should probably try to reduce the number of your shards...

Also make sure, that you are using compressed pointers when allocating roughly 32gb of memory. See https://www.elastic.co/guide/en/elasticsearch/reference/7.3/heap-size.html

the allocation timeout looks like it has been set, because the cluster is so busy with this crazy high number of shards. Fixing the problem at the core should help a lot!

--Alex

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.