We are working with Elastic 7.2.1.
Elastic master (node-2) had a long GC pause.
[2020-06-30T07:30:11.778+0000][3665][gc ] GC(248) Pause Young (Allocation Failure) 222M->86M(494M) 32547.255ms
[2020-06-30T07:30:11.778+0000][3665][gc,cpu ] GC(248) User=0.01s Sys=0.00s Real=32.54s
Then other nodes couldn't elect a new master for a long time.
We have a 3-nodes cluster, all of them are master eligible.
This is our configuration
cluster.publish.timeout: 15s
cluster.fault_detection.leader_check.timeout: 5s
cluster.fault_detection.follower_check.timeout: 5s
cluster.follower_lag.timeout: 10s
node-1
[2020-06-30T07:29:56,793][INFO ][o.e.c.s.ClusterApplierService] [node-1] master node changed {previous [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory
=136798695424, ml.max_open_jobs=20, xpack.installed=true}], current []}, term: 60, version: 804, reason: becoming candidate: onLeaderFailure
[2020-06-30T07:30:11,779][WARN ][o.e.t.TransportService ] [node-1] Received response for a request that has timed out, sent [32009ms] ago, timed out [27007ms] ago, action [internal:coordination/fault_de
tection/leader_check], node [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}], id [43286]
[2020-06-30T07:30:11,779][WARN ][o.e.t.TransportService ] [node-1] Received response for a request that has timed out, sent [26007ms] ago, timed out [21006ms] ago, action [internal:coordination/fault_de
tection/leader_check], node [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}], id [43287]
[2020-06-30T07:30:11,780][WARN ][o.e.t.TransportService ] [node-1] Received response for a request that has timed out, sent [20005ms] ago, timed out [15004ms] ago, action [internal:coordination/fault_de
tection/leader_check], node [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}], id [43288]
[2020-06-30T07:30:14,785][INFO ][o.e.c.c.JoinHelper ] [node-1] failed to join {node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.ma
x_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, xpack.installed=true, m
l.max_open_jobs=20}, optionalJoin=Optional[Join{term=61, lastAcceptedTerm=60, lastAcceptedVersion=804, sourceNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.mach
ine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, targetNode={node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_op
en_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.RemoteTransportException: [node-3][9.151.141.3:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 62 while handling publication
at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1012) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) ~[elasticsearch-7.2.1.jar:7.2.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:835) [?:?]
node-2
[2020-06-30T07:30:13,784][INFO ][o.e.c.s.MasterService ] [node-2] node-left[{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_ope
n_jobs=20, xpack.installed=true} followers check retry count exceeded], term: 60, version: 805, reason: removed {{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machi
ne_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true},}
[2020-06-30T07:30:14,614][INFO ][o.e.c.s.ClusterApplierService] [node-2] master node changed {previous [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory
=136798695424, xpack.installed=true, ml.max_open_jobs=20}], current []}, term: 60, version: 804, reason: becoming candidate: Publication.onCompletion(false)
[2020-06-30T07:30:14,614][WARN ][o.e.c.s.MasterService ] [node-2] failing [node-left[{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, m
l.max_open_jobs=20, xpack.installed=true} followers check retry count exceeded]]: failed to commit cluster state version [805]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1353) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1293) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:172) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.access$600(Publication.java:41) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication$PublicationTarget.onFaultyNode(Publication.java:275) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.lambda$onFaultyNode$2(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
at org.elasticsearch.cluster.coordination.Publication.onFaultyNode(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.start(Publication.java:69) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1044) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.2.1.jar:7.2.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: non-failed nodes do not form a quorum
at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:170) ~[elasticsearch-7.2.1.jar:7.2.1]
... 18 more
[2020-06-30T07:30:16,174][ERROR][o.e.c.c.Coordinator ] [node-2] unexpected failure during [node-left]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1353) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1293) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:172) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.access$600(Publication.java:41) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication$PublicationTarget.onFaultyNode(Publication.java:275) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.lambda$onFaultyNode$2(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
at org.elasticsearch.cluster.coordination.Publication.onFaultyNode(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.start(Publication.java:69) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1044) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.1.jar:7.2.1]