How robust is an Elastic cluster for planned single host outages?

We are doing some network changes on about a dozen or so Elastic hosts, necessitating taking each host off the network briefly (up to a few minutes).

We have at least 2 replicas for every index (ie 1 primary + 2 replicas)

We tried with one and took the same approach as we do if we patch our hosts one at a time:

  1. Set the cluster's cluster.routing.allocation.enable to "primaries"
  2. Stop Elastic services
  3. Restart host
  4. Start Elastic services
  5. Restore cluster.routing.allocation.enable to "all"
  6. Wait for the cluster to return to green health
  7. Go to next host

For the network outage work, we didn't stop the Elastic services, the network outage was very short, and the cluster stayed green all along.

  1. For single host planned outages:
    a. Is the above list of steps correct?
    b. Is there anything else we should also do, or do instead?

  2. For the above described planned single host network outage:
    a. Is the above list of steps overkill?
    b. Is there anything else we should also do, or do instead?
    c. Would we be able to do no setup at all, and just take the host off the air like would happen with an unplanned single host outage?

The manual includes detailed instructions on performing a rolling restart. It's roughly what you describe, but the one in the manual is the recommended process.

Elasticsearch is resilient to unplanned single-host outages as long as you have designed your cluster correctly so yes there isn't technically a need to do anything special. The only difference is in how long it takes for the cluster to fully recover and stabilise after each step. If Elasticsearch is not prepared for the outage then it may react (e.g. by moving shards around) because it doesn't know the node is coming back soon. By disabling allocation and so on you can avoid these unwanted reactions.

1 Like