We are occasionally encountering es cluster instability due to non optimized query. Our developers are working on optimizing queries. When the poor performing queries are sent to es, gc time spikes and cpu utilization spikes to 90%. This causes the data node not to respond to the master's ping request. So, few data nodes leave the cluster and rejoins the cluster later.
It takes almost 45 minutes to 1 hour for the cluster to become 'GREEN' due to the replica shard reallocation after data node leaves the cluster.
Currently 'discovery.zen.ping.timeout' is set to default (30 seconds). Also, 'index.unassigned.node_left.delayed_timeout' is set to default (1 minute).
My question is: Until we optimize all the queries, which of the above settings should I tweak to make sure the cluster is stable or at least turn 'GREEN' sooner?
The first thing you can do is upgrade, there are a number of improvements around circuit breakers that will prevent these bad queries from even running.
The second and third things you can do are also upgrade
Is there any other settings for circuit breaker other than the above?
Also, will it help to increase the 'discovery.zen.ping.timeout' from 30 seconds to 1 or 2 minutes? Changing this settings will involve es service restart, correct?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.