Slow cluster recovery after node system updates

ecology5913 · November 17, 2025, 10:36pm

Good point, the initial decision to keep it in a maintenance window was due to a yellow cluster supposedly degrading performance (this was read in documentation, but not actually observed).
As we get more (and bigger) clusters, our traditional maintenance window will likely be less and less suited.

I know of some clusters that can take a week for this sort of thing.

I would love to know more about a scenario like that, specifically how these are updated.
We use an ansible playbook that, after each node upgrade, polls the cluster for a green or acceptable yellow (no relo/init shards) status for a fixed amount of minutes.

I can perfectly well let it run indefinitely until the cluster has finished its recovery.
I do kind of struggle thinking of edge cases that I would need to evaluate apart from cluster status if I were to do such long running, unattended, updates.
I would imagine, for example, that I need to parse the result of a /_cluster/allocation/explain from time to time, as a way to make sure I'm not stuck on things like space constraints. This is something we experience a lot with one of the clusters which has shards of 1 Terabyte.
Although, in that scenario, the initializing and relocating shards should also be 0. If not, it might mean the cluster is still balancing to make room for the unallocated shard and will eventually succeed.

The previous paragraph is more me thinking out loud.
I would be interested to know if, with the 1 week cluster recovery example, what kind of checks are built around it to ensure the rolling upgrade is not 'stuck'.

Topic		Replies	Views
Is it me or is ES 1.6.0 node startup/recovery slower then before? Elasticsearch	15	1143	July 6, 2017
Cluster turning into green state takes long time Elasticsearch	9	1624	November 4, 2019
Shard allocation on restarted node takes too long Elasticsearch	5	3541	July 5, 2017
How to avoid/lighten shard recovery after restart? Elasticsearch	2	475	July 6, 2017
Restarting of node taking much time Elasticsearch	6	2522	July 6, 2017

Slow cluster recovery after node system updates

Related topics