How robust is an Elastic cluster for planned single host outages?

ankh · November 25, 2022, 7:20am

We are doing some network changes on about a dozen or so Elastic hosts, necessitating taking each host off the network briefly (up to a few minutes).

We have at least 2 replicas for every index (ie 1 primary + 2 replicas)

We tried with one and took the same approach as we do if we patch our hosts one at a time:

Set the cluster's cluster.routing.allocation.enable to "primaries"
Stop Elastic services
Restart host
Start Elastic services
Restore cluster.routing.allocation.enable to "all"
Wait for the cluster to return to green health
Go to next host

For the network outage work, we didn't stop the Elastic services, the network outage was very short, and the cluster stayed green all along.

For single host planned outages:
a. Is the above list of steps correct?
b. Is there anything else we should also do, or do instead?
For the above described planned single host network outage:
a. Is the above list of steps overkill?
b. Is there anything else we should also do, or do instead?
c. Would we be able to do no setup at all, and just take the host off the air like would happen with an unplanned single host outage?

DavidTurner · November 25, 2022, 10:23am

The manual includes detailed instructions on performing a rolling restart. It's roughly what you describe, but the one in the manual is the recommended process.

Elasticsearch is resilient to unplanned single-host outages as long as you have designed your cluster correctly so yes there isn't technically a need to do anything special. The only difference is in how long it takes for the cluster to fully recover and stabilise after each step. If Elasticsearch is not prepared for the outage then it may react (e.g. by moving shards around) because it doesn't know the node is coming back soon. By disabling allocation and so on you can avoid these unwanted reactions.

ankh · November 28, 2022, 4:21am

Thank you very much @DavidTurner for your reply. That's the method I'm using, so that's good.

However, I noticed that when I did this just now, the cluster went to yellow during the outage and now that I've restored the setting ("cluster.routing.allocation.enable": null) there are a number of shards that are being relocated to the host that had the outage.

The shards are for some large indexes, and I can see the indexes have the setting "index.unassigned.node_left.delayed_timeout" : "1m"

There was about 120 unassigned shards, and now, four hours later, it's still processing the shard movements, with 10 shards yet to go.

What caused these shards to need reassignment?

I thought just changing "cluster.routing.allocation.enable": "primaries" was enough, but does this mean that the shards being relocated are primaries and thus not affected by "cluster.routing.allocation.enable": "primaries"?

Does this mean that I still need to also set "index.unassigned.node_left.delayed_timeout" to be a larger value to cater for outages?

DavidTurner · November 28, 2022, 7:48am

Hm interesting. I wonder if we should recommend setting cluster.routing.rebalance.enable: none too, in case the cluster wasn't properly balanced before the outage.

Have you set any of the following settings? If so, what values are you using?

cluster.routing.allocation.node_initial_primaries_recoveries
cluster.routing.allocation.node_concurrent_recoveries
cluster.routing.allocation.node_concurrent_incoming_recoveries
cluster.routing.allocation.node_concurrent_outgoing_recoveries
cluster.routing.allocation.allow_rebalance

ankh · November 28, 2022, 9:03am

Thanks again. I don't think these have been set explicitly; the current values are:

cluster.routing.allocation.node_initial_primaries_recoveries = 4
cluster.routing.allocation.node_concurrent_recoveries = 2
cluster.routing.allocation.node_concurrent_incoming_recoveries = 2
cluster.routing.allocation.node_concurrent_outgoing_recoveries = 2
cluster.routing.allocation.allow_rebalance = indices_all_active

DavidTurner · November 28, 2022, 9:49am

Ok I think that means that some (primary) shards relocated during the outage because of deliberate allocation changes (e.g. adjusting allocation filters or migrating them between tiers) which will have disturbed the balance of the cluster.

ankh · November 29, 2022, 3:52am

Thanks David. I'm not sure what triggered the relocation.

We performed more outages today, this time I set:

"cluster.routing.allocation.enable": "primaries"
"index.unassigned.node_left.delayed_timeout": "15m"

and the first few hosts were fine, and the cluster stayed green throughout. Then a host had some issues and the outage took longer than 15m and at 20m it went yellow with the same number of shards as yesterday in delayed_unassigned_shards.

The problem was worked around and I reverted to

"cluster.routing.allocation.enable": "all"

and it's now moving shards back to that host, looks like another few hours to wait.

So for my understanding, is it correct that:

Changing the delayed timeout to 15m helped increase the breathing space we had for the outage
Something else happened after 15m to cause the shard reallocation
- I left our load jobs going so that data could still flow into the cluster, expecting that it would be written OK and available to consumers OK. Is that correct?
- I presume the writing of data to the cluster is the sort of event that could cause the shard reallocation?

Should I try further settings for these planned outages, eg the

cluster.routing.rebalance.enable: none

you mentioned above? Are there risks with this?

DavidTurner · November 29, 2022, 8:50am

Yes I think so.

I don't expect this to do anything, since cluster.routing.allocation.allow_rebalance: indices_all_active will also disable rebalancing while the cluster is missing a node.

ankh · November 29, 2022, 9:33pm

Thank you very much David, you've been a great help!

system · December 27, 2022, 9:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Planned total network page - how will the cluster react? Elasticsearch	5	305	July 27, 2021
Elasticsearch rolling restart recovery is slow Elasticsearch	3	1299	January 10, 2020
Cluster.routing.allocation.enable vs planned node outages, eg. rolling node restarts Elasticsearch	2	305	March 17, 2022
Stop-start an elasticsearch instance having all the primary shards Elasticsearch	14	1105	March 19, 2020
Rolling restart elasticsearch cluster Elasticsearch	5	1873	July 5, 2017

How robust is an Elastic cluster for planned single host outages?

Related topics