Handling loss of master quorum in ES7+

I am planning to upgrade an ES 6.3 cluster to 7.2 soon. In ES 6.3 if we lose a quorum of master nodes we can bring up new nodes which will join the cluster and regain quorum, but after ES 7 new master nodes cannot join a cluster which does not have quorum and the cluster will thus be permanently impaired. While we never kill more than one node at a time, we would still like to handle a loss of master quorum due to any unforeseen circumstances or hardware issues, and without bringing up a new cluster and restoring a snapshot. Do you have any recommendations around fixing an ES 7 cluster after the loss of a quorum of masters ?

The process you describe does not actually work in 6.3 either but, dangerously, it mostly appears to work in versions before 7.0. In fact it seriously risks the silent loss of some of your data, and this is a bug that is fixed in 7.0.

The only safe path forward after permanently losing a majority of your master-eligible nodes is to restore from a snapshot. It'd have to be a pretty big disaster to permanently lose more than one node at once, wouldn't it? As long as their disks are still readable you can still recover by moving the disks to new machines.

Thanks David. Our Elasticsearch cluster is hosted in AWS so recovery by moving the disks is not straightforward, and the time taken to restore the snapshot is not good enough for availability. Not to mention that we have some clusters where we can't serve data from even few hours old snapshots. We might have to go with moving the disks for now, but do let us know if another solution pops up.

If you need to tolerate the loss of two or more nodes you will need at least 5 master-eligible nodes, so that the remaining ones would still form a quorum. In a cloud environment it can work well to temporarily grow the cluster for extra resilience during maintenance and then shrink it again afterwards.

While our problem is more around any unforeseen issues rather than maintenance, I'll keep this in mind. Thanks!

If running on AWS, have you made sure you are using 3 availability zones with a master node in each? I have never seen the level of unreliability you describe apart from when using spot instances. How have you deployed your cluster?

We have masters in different AZs and are not using spot instances. The level of unreliability we want to handle is not just due to AWS, but due to several of our systems and tools maintained by several different teams interacting with each other. A single bad push somewhere can cause ripple effects, and more safety, lower MTTR are always good to have.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.