Slow cluster recovery after node system updates

I’ve been doing lifecycle management for ES clusters recently, from clusters with indices without replicas to clusters with 1TB shards.
While the problems with those examples are obvious, I recently got more ‘normal‘ clusters (upgrading system packages and rebooting, ES is pinned at version 7.16.3).
Twice in a row now during system updates, these have been taking more time to recover than my maintenance window would allow for (2 hours).

The elastic docs provide a clear path to follow, I made and use an Ansible playbook that updates the nodes based on those directives (excluding tiered storage and ML tasks, these are not actively used in our managed clusters).
The playbook only continues upgrading the next node if either the cluster is green or the cluster is yellow with no actively initializing/relocating shards.

My question is, when the node is back and allocation is set back to all shards, what exactly is the 'recovery' and how long is it supposed to take? For the latter, I know that 'it depends' is the only acceptable answer.
My understanding of what happens during recovery:

  • The node is brought back and tells the cluster what shards it has.
  • The indices (primary shards) created during the upgrade will spread its replicas around

Is there a recovery aspect that I am missing? Because this makes it seem to me that recovery should be nowhere near 2 hours.
It must be something with the cluster, it consists of 2 data nodes and 1 witness.
Apart from the older version (7.16.3):

  • No new indices were made during the time replica shard allocation was disabled
  • The 5 largest shards are 40GB, the remaining 172 shards are no larger than 5GB.
  • Only 2 shards have 170 million documents, the rest no more than 50 million.
  • There is ample space left on both nodes (>50%)
  • A maximum of 400Mbits/s was transferred between nodes, not saturating the 1Gbit link.
  • node_concurrent_recoveries and node_concurrent_outgoing_recoveries are set to 10 with the indices.recovery.max_bytes_per_sec defaulting to 40MB

Hello and welcome to the forum.

The recover step in the rolling upgrade process is when the cluster will initialize primary shards, allocate replica shards and also relocate shards to balance the cluster.

As you mentioned the time that this will take depends on a lot of things, like the number of shards, the number of replicas, the size of the shards, the specs of the nodes etc.

It is not clear what you mean with 1 witness, is this a typo? This is not an elasticsearch concept, is this a tie-breaker node to have 2 data+master nodes and 1 tie-breaker node to have a quorum?

One issue in your configuration, specially in this old version is this:

node_concurrent_recoveries and node_concurrent_outgoing_recoveries are set to 10

The default value for those settings is 2, and in the documentation for newer versions like 8.X and 9.X it is recommended to not change those settings.

Basically this:

Increasing this setting may cause shard movements to have a performance impact on other activity in your cluster, but may not make shard movements complete noticeably sooner. We do not recommend adjusting this setting from its default of 2

I suspect that changing this from 2 to 10 may be causing some unnecessary shard movement (the relocation status, relo), which will then impact in your cluster recovery time.

2 Likes

Thank you very much for the response.

It is not clear what you mean with 1 witness, is this a typo? This is not an elasticsearch concept, is this a tie-breaker node to have 2 data+master nodes and 1 tie-breaker node to have a quorum?

My bad, it is as you say, I meant a tie-breaker/voting-only node.

I suspect that changing this from 2 to 10 may be causing some unnecessary shard movement (the relocation status, relo), which will then impact in your cluster recovery time.

I've now read the warning in the docs, but I hope to understand the underlying reason a bit better.
Would this mean that, in my case, setting the node_concurrent_recoveries to 10 would not be better, because the extra resources 10 recoveries use amongst each other will in turn degrade performance/recovery time?
I think the answer will depend on my observed CPU load, iowait, disk performance and network. Sadly I only have the network metric at the time.

I will remove the custom configuration and monitor the results more closely in my next maintenance window.

What @leandrojmp says is good, but one more point:

“Maintenance window” typically means you’ve told users that there’s going to be an outage and to expect disruption, but if you’re doing a rolling restart you don’t need to do that. As long as the cluster remains in yellow health it will be available for indexing and searches as normal.

2 hours is pretty short, depending on your overall cluster size etc. of course. I know of some clusters that can take a week for this sort of thing.

2 Likes

Good point, the initial decision to keep it in a maintenance window was due to a yellow cluster supposedly degrading performance (this was read in documentation, but not actually observed).
As we get more (and bigger) clusters, our traditional maintenance window will likely be less and less suited.

I know of some clusters that can take a week for this sort of thing.

I would love to know more about a scenario like that, specifically how these are updated.
We use an ansible playbook that, after each node upgrade, polls the cluster for a green or acceptable yellow (no relo/init shards) status for a fixed amount of minutes.

I can perfectly well let it run indefinitely until the cluster has finished its recovery.
I do kind of struggle thinking of edge cases that I would need to evaluate apart from cluster status if I were to do such long running, unattended, updates.
I would imagine, for example, that I need to parse the result of a /_cluster/allocation/explain from time to time, as a way to make sure I'm not stuck on things like space constraints. This is something we experience a lot with one of the clusters which has shards of 1 Terabyte.
Although, in that scenario, the initializing and relocating shards should also be 0. If not, it might mean the cluster is still balancing to make room for the unallocated shard and will eventually succeed.

The previous paragraph is more me thinking out loud.
I would be interested to know if, with the 1 week cluster recovery example, what kind of checks are built around it to ensure the rolling upgrade is not 'stuck'.

1TB shards are not recommended.

A massive shard is like a massive sofa, it might fit in the destination room, and look impressive once there, but you need move everything else out of the way, and getting it through the door is a PITA. More smaller sofas are almost always better.

btw, I like ansible, it’s a great tool. But be careful not to fall into the “whatever the problem is, the only tool I will use is my hammer” trap.

2 Likes

That’s pretty much it - follow the rolling restart process in the manual, and note that this process does not include “give up if it took longer than 2h” or similar time limits :grin:

It’s kinda complicated, these things are orchestrated by something like ECE or ECK which knows an awful lot more about the internals of ES than you could achieve with Ansible.

1 Like

It’s kinda complicated, these things are orchestrated by something like ECE or ECK which knows an awful lot more about the internals of ES than you could achieve with Ansible.

Completely understandable, I may eventually recommend services like that when the clusters get very complex/demanding.

On a side note, per @leandrojmp 's comment I changed the node_concurrent_recoveries back to 2 (and set null for the incoming outgoing). This seems to work, it now takes 1 hour.

I was planning on testing an increase of indices.recovery.max_bytes_per_sec from 40MB to 90MB.
But the subsequent testing (stopping one elasticsearch node and rebooting) seems to have somehow sped up after doing it a second time. Now, in no time at all, the cluster is already green and barely any data was transferred.
In this, I think I misunderstand what happens during recovery, I imagine when a node comes back, it tells the other node(s) what shards it has and gets what he might be missing from the other node(s). And now that I've already recovered earlier, he'll already have all the shards he needs to turn green almost instantly?

It Depends™ - if you set cluster.routing.allocation.enable: primaries as per the manual then there normally won’t be many (any?) shard movements while the node is down so recoveries will be quick, just copying any missing operations into otherwise fully-in-sync shard copies.

If you don’t set this setting then removing the node will likely trigger some rebalancing operations and then adding the node back in again will trigger more rebalancing, so it can take a while to settle.

Yup! apart from the machine learning step and storage tiers, I fully follow the manual.
The hope was that cluster.routing.allocation.enable: primaries would prevent any shard movement and that, upon setting allocation back to to all, there would be next to no movement/recovery required.

For the latter, I am bit confused that recovery could take so long if the returning node already has (most, if not all) the shards.
I can understand that some missing operations would have to be copied over, but would that constitute entire shards (which the returning node already had pre-update/reboot) having to be copied over fully?

At this point, I tried to find what could be further wrong with the cluster configuration in terms of non-default settings.

{
  "persistent" : {
    "cluster.routing.allocation.disk.watermark.low" : "88%",
    "cluster.routing.allocation.node_concurrent_recoveries" : "2"
  },
  "transient" : {
    "cluster.routing.allocation.enable" : "all"
  }
}

The transient setting was the only one I found odd, it turns out that I set allocation from primaries to all as a transient setting instead of a persistent setting as specified in the docs :sweat_smile:.

Looking at the docs, it warns that doing transient settings altogether is not recommended (anymore).
I will change that, but I can't imagine it causing weird behavior, it would be set to primaries before the node updates/reboots. Even if the setting disappears on ES starting, it would go back to all and start recovering.