We have 3 elasticsearch clusters all of which are suffering a strange issue of the cluster failing after our application runs a query.
We have around 1.5 TB to 2.5TB of data (including replicas) across the 3 different clusters, the clusters are all the same in structure just on different hardware.
The data is stored in around 1500 indexes and we have been running without issues for over a year. Recently we have had a need to query more of the data in the cluster and are now running complex queries across more indexes than previously.
What is happening is when the query is launched the cluster distributes the search to all the indexes (well shards) but after a short time the cluster nodes themselves will start getting network timeouts and are removed from the cluster and puts the cluster into a red state. This is so bad that I have to fully restart the cluster to get back to being able to use it.
I believe the cause of this is that there is so much transport activity going on (seen in network graphs) that the network card becomes saturated to a point where the clusters state/alive pings actually time out. This is seen on both the masters and the nodes as errors showing the pings timed out followed not log after by the instance receiving the missing ping and logging that it arrived after it expired.
To prove this was the issue I added 2 network cards to all the servers in the cluster and used Linux bonding with roundrobin to give me a 2GB ethernet card rather than the 1GB previously used. Doing this provides enough bandwidth for the particular that test query to then succeed and return data as expected without any ping issues however using a slightly more complex query then causes the same problem.
Normally with clusters I have used in the past I would fix this by having the ping checks go over their own LAN and the transport data over a separate LAN, this allows the cluster to maintain its state checks even when the transport load is heavy.
Is there a way to do this with the current elasticsearch transport module, from what I have read so far apparently not but if anyone has suggestions on how it can be done I would love to know.
I already segregate http traffic for ES and the transport traffic to avoid timeout issues like this but I can't see how to split the pings from the standard transport data if this is even possible.
Forgot to mention the instances in the cluster are on version 2.3.4
The issue also occurs on version 2.3.1.