Managing loss of master quorum from simultaneous restart of nodes

Dandy · March 26, 2020, 5:13am

Similar in nature to this:

In 7.X I have a scenario where the quorum is lost due to a restart of all the nodes at the same time. All the data is still present, so cluster state is OK.

What's the solution to working around this? I have masters constantly trying to connect to each other but failing:

[2020-03-26T05:08:41,583][INFO ][o.e.c.c.JoinHelper       ] [elasticsearch-es-master-1] failed to join {elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot} with JoinRequest{sourceNode={elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot}, optionalJoin=Optional[Join{term=94, lastAcceptedTerm=93, lastAcceptedVersion=5099765, sourceNode={elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot}, targetNode={elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot}}]}
org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-master-1][10.244.21.61:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 95 while handling publication
	at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1012) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.2.0.jar:7.2.0]

It seems they can't establish quorum and appear to be all over the place right now.

Dandy · March 26, 2020, 7:18am

Okay, looks like it's literally just because we have slow disks and the join/publish timeout is being exceeded constantly. Increased the cluster.join.timeout to about 300 seconds and that solved this.

Christian_Dahlqvist · March 26, 2020, 7:23am

If that is the case I would recommend using more performance storage in order to improve stability and availability. What kind of storage are you using?

Dandy · March 26, 2020, 7:58am

I don't remember exactly what HDD we're using because we use a block storage platform, that being said, we have SSD space so I can migrate the masters to SSDs because it's just sensible.

rcowart · March 26, 2020, 9:52am

@Dandy you may be interested in this video where I compare various storage options for Elasticsearch...

I tested 1xNVMe, 1xSSD, 2xSSD (RAID-0), 1xHDD, 4xHDD (RAID-0)

TL;DR - HDDs are HORRIBLE (even local multi-spindle RAID-optimized). NVMe isn't the best option like you might expect. Multi-SSD (RAID-0 - SATA/SAS) is the way to go.

Rob

How to install Elasticsearch & Kibana on Ubuntu - incl. hardware recommendations
What is the best storage technology for Elasticsearch?

DavidTurner · March 26, 2020, 10:49am

Might also be worth upgrading to 7.6 or later since there's been some recent reductions in the IO needs of master-eligible nodes that should make them work much better on slower disks.

Dandy · April 21, 2020, 4:51am

Thank you @rcowart. Sorry for the delay in responding, this is very much appreciated,

system · May 19, 2020, 4:51am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Handling loss of master quorum in ES7+ Elasticsearch	7	1982	August 20, 2019
Elasticsearch Master Quorum is lost Elasticsearch	4	383	March 14, 2023
Master nodes do not detect the other masters after service restart Elasticsearch	10	5079	August 21, 2019
Nodes continuously leaving and rejoining the cluster in 7.1 cluster after master switch Elasticsearch	8	1992	October 15, 2020
How to restart an Elasticsearch cluster (2 master node, 2 data node, 1 voting-only master-eligible node) after a 1 master node and 1 data node failed due to hardware failure without losing data? Elasticsearch docker	6	608	November 14, 2023

Managing loss of master quorum from simultaneous restart of nodes

Related topics