One datanode stalled will cause master fail

Elastcsearch verion: 7.0.1

Somehow when my cluster was under heavy load, one or two datanodes were continuously timing out requests. Then i see following Error message. This would fail leader master node, and broke down the whole cluster.

{"log":"org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s\n","stream":"stdout","time":"2019-08-22T00:02:34.180925405Z"}   
...
{"log":"[2019-08-22T00:02:36,358][ERROR][o.e.c.c.Coordinator      ] [dc17-esmaster-02] unexpected failure during [node-left]\n","stream":"stdout","time":"2019-08-22T00:02:36.359512415Z"}
{"log":"org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed\n","stream":"stdout","time":"2019-08-22T00:02:36.359540571Z"}
{"log":"\u0009at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1350) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359544582Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359548088Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359551174Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359554333Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359557356Z"}
{"log":"\u0009at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1290) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359567206Z"}
{"log":"\u0009at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359571008Z"}
{"log":"\u0009at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:88) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359574024Z"}
{"log":"\u0009at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$2.run(Coordinator.java:1257) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.359576978Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.35957989Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]\n","stream":"stdout","time":"2019-08-22T00:02:36.359582883Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]\n","stream":"stdout","time":"2019-08-22T00:02:36.359585555Z"}
{"log":"\u0009at java.lang.Thread.run(Thread.java:835) [?:?]\n","stream":"stdout","time":"2019-08-22T00:02:36.35958833Z"}
{"log":"Caused by: org.elasticsearch.ElasticsearchException: publication cancelled before committing: timed out after 30s\n","stream":"stdout","time":"2019-08-22T00:02:36.359590919Z"}
{"log":"\u0009at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:85) ~[elasticsearch-7.0.1.jar:7.0.1]\n","stream":"stdout","time":"2019-08-22T00:02:36.35959374Z"}
{"log":"\u0009... 5 more\n","stream":"stdout","time":"2019-08-22T00:02:36.359596521Z"}
{"log":"[2019-08-22T00:02:36,360][ERROR][o.e.x.s.a.TokenService   ] [dc17-esmaster-02] unable to install token metadata\n","stream":"stdout","time":"2019-08-22T00:02:36.360686359Z"}
{"log":"org.elasticsearch.cluster.NotMasterException: no longer master. source: [install-token-metadata]\n","stream":"stdout","time":"2019-08-22T00:02:36.360698367Z"}

Other than add more hardware, not sure if any configuration tuning can prevent this to happen

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.