Unstable elasticsearch cluster

hey guys

sorry in advance for the long post, we are new to the elastic search world, and are not entirely sure where the issue lies here. the issue could either be with AWS networking interface itself or some connection troubles in the es cluster. We started seeing this issue recently after we spun up a new stage cluster that was meant to look exactly the same just with a different cluster name, in the same vpc. so these issues could also have something to do with that, or not.

our cluster blueprint is as follows: an aws ec2 based cluster with 9 nodes. 3 client, 3 master and 3 data nodes.

elasticsearch.yml configuration looks like this:


Now to the errors. As well as giving you are curtailed version of some elasticsearch.log files, i would also note that we have had 4 nodes (ec2 machines) fail the instance status check. which we have stopped and started, to restore functionality.

So the cluster seems to be intermittently (every couple minutes or so) failing -- goes to red then yellow then back to green — the time this takes varies, and only holds the green state for a short while (3-5 mins). If i were to check the health of the cluster at any given time, there would be a good 80% chance that I would be seeing either yellow or red.

As far as I can tell, wading through many logs, this is my best guess at what is happening (in logs). We start out here with green cluster followed soon by what looks like network connectivity issues:


Note that the time between master adding es-elk-esdata-01 and the failed to execute error, is about 4 minutes. in this time, the cluster is green.

so we see 10s of these failed to execute on node error, for all esdata nodes (01,02,03).

then we see these NodeDisconnected errors for every index:

[2015-11-12 11:16:12,529][DEBUG][action.admin.indices.stats] [es-elk-esmaster-01] [logstash-2015.08.06][2], node[Nv9dvRY4SBOiYNan_Ja76g], [P], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@6fc0d1b6]

then we see this:

And i realize the timestamps are almost an hr off between the NodeDisconnect and the gateway.local errors.. i may have copy pasted from wrong parts of the log file (ie, a differenct failure 'episode').

We are running elasticsearch 1.4.4