We have a few high traffic ES clusters running in a production environment.
Our setup varies slightly from location to location but is generally as follows:
3 master nodes w/ 4 GB ram
4 client nodes behind a round robin load balancer with 64GB ram
18 data nodes w/ 8 x SSDs with 256GB ram
We are running Elasticsearch 2.4.1. All nodes have 30.5GB heap.
In the last couple of months we've had a few issues related to a yet-to-be identified surge in heap usage, on one node, triggering it to become unresponsive.
Our shard distribution is such that most queries end up hitting every data node. Which means that whenever a node is unresponsive, our cluster becomes unresponsive as well. So what we have our our hands is a nice setup with a distributed single point of failures cluster.
This begs the question: With a replica factor of 2 for all indices, would it not be possible to be smarter about this -- have ES query all three shards at the same time and use whichever one is fastest ?
During normal operation, our used heap hovers around 20GB so the committed 30.5GB should suffice. Yes, we need to figure out the issue on our end but it would be nice if the cluster could be better at safeguarding us from our own misconfigurations.
No marvel in production. We're using it in our test environment but haven't gotten the go-ahead for paying the $$$$ needed for using it in production. Note that my main motivation for making the post is the distributed single point of failure part.
As our cluster grows we're sort of always expecting that the heap usage will sneak up on us. It would be nice if this wouldn't cause such catastrophic impact. We're actively monitoring GC logs to catch early signs of heap pressure, but the sudden surges aren't caught by this.
Was not aware of that Thought it was $500 per server per year or something.
Regardless, care to comment on the main underlying comment about the distributed single points of failure ?
We aim to achieve HA, but regardless of the stability of our own setup or Elasticsearch itself, a situation where one node becomes slow or temporary unreachable is inevitable.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.