Had a unique situation occur recently, and I've been thinking about how to prevent it from recurring.
I have an Elasticsearch cluster which happens to run on multiple docker containers, on rackmount servers. Essentially, multiple nodes run on each piece of hardware. We use "rack awareness" to let Elastic know not to put replica shards on a node on the same piece of physical hardware. This setup works well.
Last week, we had a (hardware) server fail (Bad RAID card). This resulted in the loss of 4 Elastic nodes essentially at the same time. Because of our Rack awareness settings, this did not result in the loss of any data.
However, what happened next was the cluster decided to start allocating new replicas for all the shards that no longer had redundancy. And...Elastic essentially just filled up the disks on the remaining nodes, it would no longer ingest data, and caused an actual service outage.
While it would be easy here to just say, have more capacity, it's not really the point of my question. When one node fails, the existing recovery procedures work just fine; but if we lose multiple at the same time (such as when a physical server, or "an entire rack" from ES perspective), I would like to get service up obviously, but I'd like new replicas to be delayed until a human can evaluate the situation.
The above is the general setting I'm thinking...it's set to 5min in our cluster today. So, the question really is this:
Is it possible to have the above setting applied differently for multiple node failures?
Is there some alternative, for the scenario I described, that might be more elegant than setting the delayed allocation setting to something like "8 hours" to ensure a tech can evaluate a node failure before it decided to replicate itself to an outage again?