Elastcsearch verion: 7.0.1
Somehow when my cluster was under heavy load, one or two datanodes were continuously timing out requests. Then i see following Error message. This would fail leader master node, and broke down the whole cluster.
{"log":"org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s\n","stream":"stdout","time":"2019-08-22T00:02:34.180925405Z"}
...
{"log":"[2019-08-22T00:02:36,358][ERROR][o.e.c.c.Coordinator ] [dc17-esmaster-02] unexpected failure during [node-left]\n","stream":"stdout","time":"2019-08-22T00:02:36.359512415Z"}
{"log":"org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed\n","stream":"stdout","time":"2019-08-22T00:02:36.359540571Z"}
{"log":"\u0009at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1350) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359544582Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359548088Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359551174Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359554333Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359557356Z"}
{"log":"\u0009at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1290) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359567206Z"}
{"log":"\u0009at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359571008Z"}
{"log":"\u0009at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:88) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359574024Z"}
{"log":"\u0009at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$2.run(Coordinator.java:1257) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359576978Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.35957989Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]\n","stream":"stdout","time":"2019-08-22T00:02:36.359582883Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]\n","stream":"stdout","time":"2019-08-22T00:02:36.359585555Z"}
{"log":"\u0009at java.lang.Thread.run(Thread.java:835) [?:?]\n","stream":"stdout","time":"2019-08-22T00:02:36.35958833Z"}
{"log":"Caused by: org.elasticsearch.ElasticsearchException: publication cancelled before committing: timed out after 30s\n","stream":"stdout","time":"2019-08-22T00:02:36.359590919Z"}
{"log":"\u0009at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:85) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.35959374Z"}
{"log":"\u0009... 5 more\n","stream":"stdout","time":"2019-08-22T00:02:36.359596521Z"}
{"log":"[2019-08-22T00:02:36,360][ERROR][o.e.x.s.a.TokenService ] [dc17-esmaster-02] unable to install token metadata\n","stream":"stdout","time":"2019-08-22T00:02:36.360686359Z"}
{"log":"org.elasticsearch.cluster.NotMasterException: no longer master. source: [install-token-metadata]\n","stream":"stdout","time":"2019-08-22T00:02:36.360698367Z"}
Other than add more hardware, not sure if any configuration tuning can prevent this to happen