Managing loss of master quorum from simultaneous restart of nodes

Similar in nature to this:

In 7.X I have a scenario where the quorum is lost due to a restart of all the nodes at the same time. All the data is still present, so cluster state is OK.

What's the solution to working around this? I have masters constantly trying to connect to each other but failing:

[2020-03-26T05:08:41,583][INFO ][o.e.c.c.JoinHelper       ] [elasticsearch-es-master-1] failed to join {elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot} with JoinRequest{sourceNode={elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot}, optionalJoin=Optional[Join{term=94, lastAcceptedTerm=93, lastAcceptedVersion=5099765, sourceNode={elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot}, targetNode={elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot}}]}
org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-master-1][10.244.21.61:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 95 while handling publication
	at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1012) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.2.0.jar:7.2.0]

It seems they can't establish quorum and appear to be all over the place right now.

Okay, looks like it's literally just because we have slow disks and the join/publish timeout is being exceeded constantly. Increased the cluster.join.timeout to about 300 seconds and that solved this.

If that is the case I would recommend using more performance storage in order to improve stability and availability. What kind of storage are you using?

I don't remember exactly what HDD we're using because we use a block storage platform, that being said, we have SSD space so I can migrate the masters to SSDs because it's just sensible.

@Dandy you may be interested in this video where I compare various storage options for Elasticsearch...

I tested 1xNVMe, 1xSSD, 2xSSD (RAID-0), 1xHDD, 4xHDD (RAID-0)

TL;DR - HDDs are HORRIBLE (even local multi-spindle RAID-optimized). NVMe isn't the best option like you might expect. Multi-SSD (RAID-0 - SATA/SAS) is the way to go.

0001_es_storage

Rob

GitHub YouTube LinkedIn
How to install Elasticsearch & Kibana on Ubuntu - incl. hardware recommendations
What is the best storage technology for Elasticsearch?

Might also be worth upgrading to 7.6 or later since there's been some recent reductions in the IO needs of master-eligible nodes that should make them work much better on slower disks.