Restart on daily basis

Hello everyone,

Recently we got issues with circuit.breaker exception on multiple clusters, right after upgrade from 6.8 to 7.4. It was found that problem can be fixed by jvm option tuning. However, due to other issues in the history of ES usage, we got recommendation, from one of our ES specialist, to do rolling restart of all ES clusters every day, to fix all possible memory problems in future proactively.

Does anyone do such restart in production, what negative impact we can get implementing daily restarts?

A daily restart should not be necessary at all. This sounds like it would cause more problems then solve, since it will force nodes to constantly transfer shard data between themselves as nodes are added/removed from the cluster. This in turn can cause memory pressure on it's own and eat valuable disk IO, and will put pressure on the master node as well as more cluster state updates.

All a restart will do is clear out some temporary heap garbage, but otherwise it will soon grow back to it's "steady state" heap usage. E.g. if you were at 75% heap usage before restart, you'll quickly get back to 75% after restart because that's what the node "needs" in your environment. Similarly, JVM tuning is rarely effective because it's treating the symptom not the cause.

Circuit breakers trip when you are attempting to do something that is "too large" for the cluster. So either your requests need to be optimized, or your cluster is at it's capacity given the size of data and types of requests you are asking of it. In that case you just need to expand the cluster. There are several different types of circuit breakers so it's hard for me to offer a solution, but rolling restarts on a daily basis is not a normal scenario :slight_smile:

2 Likes

I agree, daily restarts look abnormal, even if they can help with some unknown memory leak situations. We hit issue with total limit configured for 90%, when simple request _cat/recovery?active_only=true started to return circuit_breaking exception. The bad thing here, is that cluster restart helped it to work few days without problems. Maybe it was a coincidence, but now it is one of the proofs of idea to do daily restarts everywhere.

Do you have Monitoring enabled?

Yep, we monitor cluster state, heap usage, single replicas, frozen queues, and recently monitoring of circuit_breaking exception was added. During period of activity of this exception, we see some kind of plateau in heap usage around 80%, sometimes 90%, that normalizes after restart. But sometimes it returns back to same plateau very quickly, trying to recover replicas, as ployfractal described.

Is that external to Elasticsearch, or you use our included monitoring?

We use zabbix and its agents + x-pack.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.