Random data node disconnections on AWS

We are running a Elasticsearch 5.0.2 cluster of 10 m4.2xlarge data nodes with 5 m4.xlarge master nodes and currently hosting around 21TB of data. The regular ingest rate goes up to 4000/second and search rate around 300/second. The indices are multi tenant monthly with most queries going to the indices of current month. For discovery, we are using the discovery-ec2 plugin.

The cluster is stable for most amount of time, but we regularly see "master_left" exceptions coming from random data nodes. Here is an example message

[p-elasticsearch-data-10] master_left [{p-elasticsearch-master-4}{3E2FXGGaQkCVva_P_y-KHw}{Xp9lS3j1TT-xKnGIiWQMzQ}{}{}{aws_availability_zone=us-east-1d, aws_availibility_zone=us-east-1d}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]

Looking at some previous discussions in the forum I found that some changes had to be made to the sysctl variables. I made those changes:

net.ipv4.tcp_keepalive_time: 600
net.ipv4.tcp_keepalive_intvl: 60
net.ipv4.tcp_keepalive_probes: 3

However, the situation is not resolving. This is resulting in a massive shard reallocation at random hours of day slowing our queries down. Are there any more configuration properties I am missing?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.