Rolling Restart-Related Anxiety Disorder

aaron_ximm · October 21, 2016, 12:50am

Howdy hey.

Is there a doctor in the house?

Here at my organization we suffer from Rolling Restart-Related Anxiety Disorder, or, RRRAD.

Symptoms include white knuckles, elevated heart rate/blood pressure, and cold sweat.

This disorder is seen when attempting rolling restart/upgrade of our production clusters.

Patient history:

I have adapted the Elastic-distributed Ansible role to support serialized rolling restart for our clusters.

I believe I have implemented all the current best practices. E.g.:

suspension of indexing operations on all indices on the cluster
unassigned.node_left.delayed_timeout set to several minutes
set routing allocation to none before each node operation
flush and sync flush until all shards report success, before each node operation

We run with 2 replicas for production indices.

After each node is restarted, Ansible waits for it to re-join the cluster, and the shards it holds to re-validate. Once they do (in typical case, quite quickly, by design and by virtue of sync flush) replica count is reached and the cluster goes green.

To force shard re-validation, something I have found necessary to hand-hold, I have written a helper script which also:

caches the full shard allocation state for all data nodes before the rolling restart/upgrade starts
reallocates to each node the shards it held (should have data for), before enabling routing allocation

After reallocation, Ansible waits for cluster state to return to green before proceeding to the next node.

In my experience, this process works smoothly about 90% of the time.

The other 10% of the time, for unknown reasons, re-"allocation" fails to recovery one or more shards, which reported successful sync flush...

In those cases, the cluster stays yellow while 'failed' (sic) shards are re-built, and the process suspends until replica reconstruction completes.

Happily, we have not lost data due to these 'shard restart failures.'

But they are cause for anxiety, and we don't understand how/why this could be happening at all after synchronized flush succeeds.

With that in mind I have been poking around forum history to see how 'we're doing it wrong.'

Is this the typical experience? Or are we indeed somehow doing it wrong?

I am guessing we are.

All-ears with respect to things we might be overlooking in our config or process...

My dream btw is that there might someday be a Rolling Restart API... it seems like this is a challenging operation which is nonetheless quite commonly needed, and it's still kind of up to each org to implement in-house a means of doing this safely.

Is that a dream with any legs?

Aaron