We have a few high traffic ES clusters running in a production environment.
Our setup varies slightly from location to location but is generally as follows:
3 master nodes w/ 4 GB ram
4 client nodes behind a round robin load balancer with 64GB ram
18 data nodes w/ 8 x SSDs with 256GB ram
We are running Elasticsearch 2.4.1. All nodes have 30.5GB heap.
In the last couple of months we've had a few issues related to a yet-to-be identified surge in heap usage, on one node, triggering it to become unresponsive.
Our shard distribution is such that most queries end up hitting every data node. Which means that whenever a node is unresponsive, our cluster becomes unresponsive as well. So what we have our our hands is a nice setup with a distributed single point of failures cluster.
This begs the question: With a replica factor of 2 for all indices, would it not be possible to be smarter about this -- have ES query all three shards at the same time and use whichever one is fastest ?
During normal operation, our used heap hovers around 20GB so the committed 30.5GB should suffice. Yes, we need to figure out the issue on our end but it would be nice if the cluster could be better at safeguarding us from our own misconfigurations.