Settings to make cluster stable

We are occasionally encountering es cluster instability due to non optimized query. Our developers are working on optimizing queries. When the poor performing queries are sent to es, gc time spikes and cpu utilization spikes to 90%. This causes the data node not to respond to the master's ping request. So, few data nodes leave the cluster and rejoins the cluster later.

It takes almost 45 minutes to 1 hour for the cluster to become 'GREEN' due to the replica shard reallocation after data node leaves the cluster.

Currently 'discovery.zen.ping.timeout' is set to default (30 seconds). Also, 'index.unassigned.node_left.delayed_timeout' is set to default (1 minute).

My question is: Until we optimize all the queries, which of the above settings should I tweak to make sure the cluster is stable or at least turn 'GREEN' sooner?

ES version: 1.7

The first thing you can do is upgrade, there are a number of improvements around circuit breakers that will prevent these bad queries from even running.

The second and third things you can do are also upgrade :wink:

We are planning to upgrade es to higher version. We have set the circuit breaker. Following are the settings:

indices.fielddata.cache.size: 13.6GB
indices.fielddata.breaker.limit: 15.0GB

Is there any other settings for circuit breaker other than the above?

Also, will it help to increase the 'discovery.zen.ping.timeout' from 30 seconds to 1 or 2 minutes? Changing this settings will involve es service restart, correct?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.