Handling loss of master quorum in ES7+

sarthakn · July 19, 2019, 7:00pm

I am planning to upgrade an ES 6.3 cluster to 7.2 soon. In ES 6.3 if we lose a quorum of master nodes we can bring up new nodes which will join the cluster and regain quorum, but after ES 7 new master nodes cannot join a cluster which does not have quorum and the cluster will thus be permanently impaired. While we never kill more than one node at a time, we would still like to handle a loss of master quorum due to any unforeseen circumstances or hardware issues, and without bringing up a new cluster and restoring a snapshot. Do you have any recommendations around fixing an ES 7 cluster after the loss of a quorum of masters ?

DavidTurner · July 20, 2019, 5:32am

The process you describe does not actually work in 6.3 either but, dangerously, it mostly appears to work in versions before 7.0. In fact it seriously risks the silent loss of some of your data, and this is a bug that is fixed in 7.0.

The only safe path forward after permanently losing a majority of your master-eligible nodes is to restore from a snapshot. It'd have to be a pretty big disaster to permanently lose more than one node at once, wouldn't it? As long as their disks are still readable you can still recover by moving the disks to new machines.

sarthakn · July 22, 2019, 6:07pm

Thanks David. Our Elasticsearch cluster is hosted in AWS so recovery by moving the disks is not straightforward, and the time taken to restore the snapshot is not good enough for availability. Not to mention that we have some clusters where we can't serve data from even few hours old snapshots. We might have to go with moving the disks for now, but do let us know if another solution pops up.

DavidTurner · July 22, 2019, 8:51pm

If you need to tolerate the loss of two or more nodes you will need at least 5 master-eligible nodes, so that the remaining ones would still form a quorum. In a cloud environment it can work well to temporarily grow the cluster for extra resilience during maintenance and then shrink it again afterwards.

sarthakn · July 22, 2019, 11:55pm

While our problem is more around any unforeseen issues rather than maintenance, I'll keep this in mind. Thanks!

Christian_Dahlqvist · July 23, 2019, 5:04am

If running on AWS, have you made sure you are using 3 availability zones with a master node in each? I have never seen the level of unreliability you describe apart from when using spot instances. How have you deployed your cluster?

sarthakn · July 23, 2019, 9:53pm

We have masters in different AZs and are not using spot instances. The level of unreliability we want to handle is not just due to AWS, but due to several of our systems and tools maintained by several different teams interacting with each other. A single bad push somewhere can cause ripple effects, and more safety, lower MTTR are always good to have.

system · August 20, 2019, 9:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Managing loss of master quorum from simultaneous restart of nodes Elasticsearch	7	901	May 19, 2020
Elasticsearch Master Quorum is lost Elasticsearch	4	383	March 14, 2023
Recover a broken 3 node elasticsearch cluster that has only 1 node left Elasticsearch	6	2175	September 12, 2020
Master not discovered, removed nodes have been totally destroyed Elasticsearch	8	1959	December 4, 2019
When 2 of 3 masters die, how to restore the cluster? Elasticsearch docker	2	153	January 11, 2024

Handling loss of master quorum in ES7+

Related topics