Managing loss of master quorum from simultaneous restart of nodes

Similar in nature to this:

In 7.X I have a scenario where the quorum is lost due to a restart of all the nodes at the same time. All the data is still present, so cluster state is OK.

What's the solution to working around this? I have masters constantly trying to connect to each other but failing:

[2020-03-26T05:08:41,583][INFO ][o.e.c.c.JoinHelper       ] [elasticsearch-es-master-1] failed to join {elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot} with JoinRequest{sourceNode={elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot}, optionalJoin=Optional[Join{term=94, lastAcceptedTerm=93, lastAcceptedVersion=5099765, sourceNode={elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot}, targetNode={elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot}}]}
org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-master-1][10.244.21.61:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 95 while handling publication
	at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1012) ~[elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.0.jar:7.2.0]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.2.0.jar:7.2.0]

It seems they can't establish quorum and appear to be all over the place right now.

Okay, looks like it's literally just because we have slow disks and the join/publish timeout is being exceeded constantly. Increased the cluster.join.timeout to about 300 seconds and that solved this.

If that is the case I would recommend using more performance storage in order to improve stability and availability. What kind of storage are you using?

I don't remember exactly what HDD we're using because we use a block storage platform, that being said, we have SSD space so I can migrate the masters to SSDs because it's just sensible.

@Dandy you may be interested in this video where I compare various storage options for Elasticsearch...

I tested 1xNVMe, 1xSSD, 2xSSD (RAID-0), 1xHDD, 4xHDD (RAID-0)

TL;DR - HDDs are HORRIBLE (even local multi-spindle RAID-optimized). NVMe isn't the best option like you might expect. Multi-SSD (RAID-0 - SATA/SAS) is the way to go.

0001_es_storage

Rob

GitHub YouTube LinkedIn
How to install Elasticsearch & Kibana on Ubuntu - incl. hardware recommendations
What is the best storage technology for Elasticsearch?

1 Like

Might also be worth upgrading to 7.6 or later since there's been some recent reductions in the IO needs of master-eligible nodes that should make them work much better on slower disks.

Thank you @rcowart. Sorry for the delay in responding, this is very much appreciated,

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.