Nodes disconnect and rejoin the cluster on elasticsearch 5.4.0

syd05 · September 23, 2019, 11:03am

I have an elasticsearch cluster running on AWS EC2 instances . I frequently encounter issues of nodes disconnecting from the cluster and joining back in about 60 seconds. But this affects the health of the cluster's stability and health.
Error from elasticsearch logs

[2019-09-23T02:39:52,249][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [
mtr1] failed to execute [indices:monitor/stats] on node [1UNOXUgGRNC2HX6wbBFbng]
org.elasticsearch.transport.NodeNotConnectedException: [dat16][x.x.x.x:9300] Node not connected
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:621) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:115) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:513) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:476) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.sendNodeRequest(TransportBroadcastByNodeAction.java:322) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.start(TransportBroadcastByNodeAction.java:311) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:234) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:79) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1256) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.stats(AbstractClient.java:1577) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.updateIndicesStats(InternalClusterInfoService.java:270) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.refresh(InternalClusterInfoService.java:321) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.maybeRefresh(InternalClusterInfoService.java:277) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.access$500(InternalClusterInfoService.java:67) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.cluster.InternalClusterInfoService$SubmitReschedulingClusterInfoUpdatedJob.lambda$run$0(InternalClusterInfoService.java:224) ~[elasticsearch-5.4.0.jar:5.4.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.0.jar:5.4.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]

Cluster config:
3 master nodes - c3.2xlarge
16 dat nodes - i3.2xlarge
Number of shards - 48000
Heap allocation on dat nodes = 32GB
Delayed allocation timeout = 180m ( so as to minimize shard allocation to other nodes on disconnect)
Use case of cluster is for product related search

Would appreciate assistance on identifying the cause of issue. Let me know if additional inputs are needed.

--Syd

spinscale · September 24, 2019, 8:56am

before doing anything else, you should probably try to reduce the number of your shards...

Also make sure, that you are using compressed pointers when allocating roughly 32gb of memory. See https://www.elastic.co/guide/en/elasticsearch/reference/7.3/heap-size.html

the allocation timeout looks like it has been set, because the cluster is so busy with this crazy high number of shards. Fixing the problem at the core should help a lot!

--Alex

system · October 22, 2019, 8:56am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nodes disconnect without apparent reason Elasticsearch	4	510	July 6, 2017
Seeing Frequent NodeNotConnectedException errors Elasticsearch	4	11967	July 5, 2017
Random node disconnects - Java.io.IOException: Connection timed out Elasticsearch	2	5406	July 5, 2017
Elasticsearch nodes continually disconneting/reconnecting. Resulting in very high number of unassigned shards Elasticsearch	18	2657	September 3, 2020
TransportClient stuck until disconnecting from node Elasticsearch	1	372	July 22, 2019

Nodes disconnect and rejoin the cluster on elasticsearch 5.4.0

Related topics