Good point, the initial decision to keep it in a maintenance window was due to a yellow cluster supposedly degrading performance (this was read in documentation, but not actually observed).
As we get more (and bigger) clusters, our traditional maintenance window will likely be less and less suited.
I know of some clusters that can take a week for this sort of thing.
I would love to know more about a scenario like that, specifically how these are updated.
We use an ansible playbook that, after each node upgrade, polls the cluster for a green or acceptable yellow (no relo/init shards) status for a fixed amount of minutes.
I can perfectly well let it run indefinitely until the cluster has finished its recovery.
I do kind of struggle thinking of edge cases that I would need to evaluate apart from cluster status if I were to do such long running, unattended, updates.
I would imagine, for example, that I need to parse the result of a /_cluster/allocation/explain from time to time, as a way to make sure I'm not stuck on things like space constraints. This is something we experience a lot with one of the clusters which has shards of 1 Terabyte.
Although, in that scenario, the initializing and relocating shards should also be 0. If not, it might mean the cluster is still balancing to make room for the unallocated shard and will eventually succeed.
The previous paragraph is more me thinking out loud.
I would be interested to know if, with the 1 week cluster recovery example, what kind of checks are built around it to ensure the rolling upgrade is not 'stuck'.