How robust is an Elastic cluster for planned single host outages?

We are doing some network changes on about a dozen or so Elastic hosts, necessitating taking each host off the network briefly (up to a few minutes).

We have at least 2 replicas for every index (ie 1 primary + 2 replicas)

We tried with one and took the same approach as we do if we patch our hosts one at a time:

  1. Set the cluster's cluster.routing.allocation.enable to "primaries"
  2. Stop Elastic services
  3. Restart host
  4. Start Elastic services
  5. Restore cluster.routing.allocation.enable to "all"
  6. Wait for the cluster to return to green health
  7. Go to next host

For the network outage work, we didn't stop the Elastic services, the network outage was very short, and the cluster stayed green all along.

  1. For single host planned outages:
    a. Is the above list of steps correct?
    b. Is there anything else we should also do, or do instead?

  2. For the above described planned single host network outage:
    a. Is the above list of steps overkill?
    b. Is there anything else we should also do, or do instead?
    c. Would we be able to do no setup at all, and just take the host off the air like would happen with an unplanned single host outage?

The manual includes detailed instructions on performing a rolling restart. It's roughly what you describe, but the one in the manual is the recommended process.

Elasticsearch is resilient to unplanned single-host outages as long as you have designed your cluster correctly so yes there isn't technically a need to do anything special. The only difference is in how long it takes for the cluster to fully recover and stabilise after each step. If Elasticsearch is not prepared for the outage then it may react (e.g. by moving shards around) because it doesn't know the node is coming back soon. By disabling allocation and so on you can avoid these unwanted reactions.

1 Like

Thank you very much @DavidTurner for your reply. That's the method I'm using, so that's good.

However, I noticed that when I did this just now, the cluster went to yellow during the outage and now that I've restored the setting ("cluster.routing.allocation.enable": null) there are a number of shards that are being relocated to the host that had the outage.

The shards are for some large indexes, and I can see the indexes have the setting "index.unassigned.node_left.delayed_timeout" : "1m"

There was about 120 unassigned shards, and now, four hours later, it's still processing the shard movements, with 10 shards yet to go.

What caused these shards to need reassignment?

I thought just changing "cluster.routing.allocation.enable": "primaries" was enough, but does this mean that the shards being relocated are primaries and thus not affected by "cluster.routing.allocation.enable": "primaries"?

Does this mean that I still need to also set "index.unassigned.node_left.delayed_timeout" to be a larger value to cater for outages?

Hm interesting. I wonder if we should recommend setting cluster.routing.rebalance.enable: none too, in case the cluster wasn't properly balanced before the outage.

Have you set any of the following settings? If so, what values are you using?

  • cluster.routing.allocation.node_initial_primaries_recoveries
  • cluster.routing.allocation.node_concurrent_recoveries
  • cluster.routing.allocation.node_concurrent_incoming_recoveries
  • cluster.routing.allocation.node_concurrent_outgoing_recoveries
  • cluster.routing.allocation.allow_rebalance

Thanks again. I don't think these have been set explicitly; the current values are:

cluster.routing.allocation.node_initial_primaries_recoveries = 4
cluster.routing.allocation.node_concurrent_recoveries = 2
cluster.routing.allocation.node_concurrent_incoming_recoveries = 2
cluster.routing.allocation.node_concurrent_outgoing_recoveries = 2
cluster.routing.allocation.allow_rebalance = indices_all_active

Ok I think that means that some (primary) shards relocated during the outage because of deliberate allocation changes (e.g. adjusting allocation filters or migrating them between tiers) which will have disturbed the balance of the cluster.

Thanks David. I'm not sure what triggered the relocation.

We performed more outages today, this time I set:

"cluster.routing.allocation.enable": "primaries"
"index.unassigned.node_left.delayed_timeout": "15m"

and the first few hosts were fine, and the cluster stayed green throughout. Then a host had some issues and the outage took longer than 15m and at 20m it went yellow with the same number of shards as yesterday in delayed_unassigned_shards.

The problem was worked around and I reverted to

"cluster.routing.allocation.enable": "all"

and it's now moving shards back to that host, looks like another few hours to wait.

So for my understanding, is it correct that:

  • Changing the delayed timeout to 15m helped increase the breathing space we had for the outage
  • Something else happened after 15m to cause the shard reallocation
    • I left our load jobs going so that data could still flow into the cluster, expecting that it would be written OK and available to consumers OK. Is that correct?
    • I presume the writing of data to the cluster is the sort of event that could cause the shard reallocation?

Should I try further settings for these planned outages, eg the

cluster.routing.rebalance.enable: none 

you mentioned above? Are there risks with this?

Yes I think so.

I don't expect this to do anything, since cluster.routing.allocation.allow_rebalance: indices_all_active will also disable rebalancing while the cluster is missing a node.

Thank you very much David, you've been a great help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.