Random data node disconnections on AWS

anishm · February 14, 2017, 12:38pm

We are running a Elasticsearch 5.0.2 cluster of 10 m4.2xlarge data nodes with 5 m4.xlarge master nodes and currently hosting around 21TB of data. The regular ingest rate goes up to 4000/second and search rate around 300/second. The indices are multi tenant monthly with most queries going to the indices of current month. For discovery, we are using the discovery-ec2 plugin.

The cluster is stable for most amount of time, but we regularly see "master_left" exceptions coming from random data nodes. Here is an example message

[p-elasticsearch-data-10] master_left [{p-elasticsearch-master-4}{3E2FXGGaQkCVva_P_y-KHw}{Xp9lS3j1TT-xKnGIiWQMzQ}{10.0.19.220}{10.0.19.220:9300}{aws_availability_zone=us-east-1d, aws_availibility_zone=us-east-1d}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]

Looking at some previous discussions in the forum I found that some changes had to be made to the sysctl variables. I made those changes:

net.ipv4.tcp_keepalive_time: 600
net.ipv4.tcp_keepalive_intvl: 60
net.ipv4.tcp_keepalive_probes: 3

However, the situation is not resolving. This is resulting in a massive shard reallocation at random hours of day slowing our queries down. Are there any more configuration properties I am missing?

system · March 14, 2017, 12:38pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data Nodes disconnected randomly Elasticsearch	3	229	March 9, 2023
ES 1.4.2 random node disconnect Elasticsearch	4	417	July 6, 2017
ES nodes disconnects intermittently from the cluster Elasticsearch	1	630	February 8, 2018
ES 1.4.2 random node disconnect Elasticsearch	1	361	July 6, 2017
Nodes randomly, temporarily, leaving 7.3.2 cluster Elasticsearch	17	4807	May 1, 2020

Random data node disconnections on AWS

Related topics