In 7.X I have a scenario where the quorum is lost due to a restart of all the nodes at the same time. All the data is still present, so cluster state is OK.
What's the solution to working around this? I have masters constantly trying to connect to each other but failing:
[2020-03-26T05:08:41,583][INFO ][o.e.c.c.JoinHelper ] [elasticsearch-es-master-1] failed to join {elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot} with JoinRequest{sourceNode={elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot}, optionalJoin=Optional[Join{term=94, lastAcceptedTerm=93, lastAcceptedVersion=5099765, sourceNode={elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot}, targetNode={elasticsearch-es-master-1}{KG71sIO_TAOPHDCnwVdxRw}{8r3SI6OlRPKF3jfr5A81zw}{10.244.21.61}{10.244.21.61:9300}{box_type=hot}}]}
org.elasticsearch.transport.RemoteTransportException: [elasticsearch-es-master-1][10.244.21.61:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 95 while handling publication
at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1012) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.2.0.jar:7.2.0]
It seems they can't establish quorum and appear to be all over the place right now.
Okay, looks like it's literally just because we have slow disks and the join/publish timeout is being exceeded constantly. Increased the cluster.join.timeout to about 300 seconds and that solved this.
If that is the case I would recommend using more performance storage in order to improve stability and availability. What kind of storage are you using?
I don't remember exactly what HDD we're using because we use a block storage platform, that being said, we have SSD space so I can migrate the masters to SSDs because it's just sensible.
@Dandy you may be interested in this video where I compare various storage options for Elasticsearch...
I tested 1xNVMe, 1xSSD, 2xSSD (RAID-0), 1xHDD, 4xHDD (RAID-0)
TL;DR - HDDs are HORRIBLE (even local multi-spindle RAID-optimized). NVMe isn't the best option like you might expect. Multi-SSD (RAID-0 - SATA/SAS) is the way to go.
Might also be worth upgrading to 7.6 or later since there's been some recent reductions in the IO needs of master-eligible nodes that should make them work much better on slower disks.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.