Master election - got Failed To Commit Cluster State Exception

Alex_Davidovich · July 12, 2020, 7:05am

We are working with Elastic 7.2.1.
Elastic master (node-2) had a long GC pause.

[2020-06-30T07:30:11.778+0000][3665][gc ] GC(248) Pause Young (Allocation Failure) 222M->86M(494M) 32547.255ms
[2020-06-30T07:30:11.778+0000][3665][gc,cpu ] GC(248) User=0.01s Sys=0.00s Real=32.54s

Then other nodes couldn't elect a new master for a long time.
We have a 3-nodes cluster, all of them are master eligible.
This is our configuration

cluster.publish.timeout: 15s
cluster.fault_detection.leader_check.timeout: 5s
cluster.fault_detection.follower_check.timeout: 5s
cluster.follower_lag.timeout: 10s

node-1

[2020-06-30T07:29:56,793][INFO ][o.e.c.s.ClusterApplierService] [node-1] master node changed {previous [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory
=136798695424, ml.max_open_jobs=20, xpack.installed=true}], current []}, term: 60, version: 804, reason: becoming candidate: onLeaderFailure
[2020-06-30T07:30:11,779][WARN ][o.e.t.TransportService   ] [node-1] Received response for a request that has timed out, sent [32009ms] ago, timed out [27007ms] ago, action [internal:coordination/fault_de
tection/leader_check], node [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}], id [43286]
[2020-06-30T07:30:11,779][WARN ][o.e.t.TransportService   ] [node-1] Received response for a request that has timed out, sent [26007ms] ago, timed out [21006ms] ago, action [internal:coordination/fault_de
tection/leader_check], node [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}], id [43287]
[2020-06-30T07:30:11,780][WARN ][o.e.t.TransportService   ] [node-1] Received response for a request that has timed out, sent [20005ms] ago, timed out [15004ms] ago, action [internal:coordination/fault_de
tection/leader_check], node [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}], id [43288]
[2020-06-30T07:30:14,785][INFO ][o.e.c.c.JoinHelper       ] [node-1] failed to join {node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.ma
x_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, xpack.installed=true, m
l.max_open_jobs=20}, optionalJoin=Optional[Join{term=61, lastAcceptedTerm=60, lastAcceptedVersion=804, sourceNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.mach
ine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, targetNode={node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_op
en_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.RemoteTransportException: [node-3][9.151.141.3:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 62 while handling publication
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1012) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) ~[elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]

node-2

[2020-06-30T07:30:13,784][INFO ][o.e.c.s.MasterService    ] [node-2] node-left[{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_ope
n_jobs=20, xpack.installed=true} followers check retry count exceeded], term: 60, version: 805, reason: removed {{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machi
ne_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true},}
[2020-06-30T07:30:14,614][INFO ][o.e.c.s.ClusterApplierService] [node-2] master node changed {previous [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory
=136798695424, xpack.installed=true, ml.max_open_jobs=20}], current []}, term: 60, version: 804, reason: becoming candidate: Publication.onCompletion(false)
[2020-06-30T07:30:14,614][WARN ][o.e.c.s.MasterService    ] [node-2] failing [node-left[{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, m
l.max_open_jobs=20, xpack.installed=true} followers check retry count exceeded]]: failed to commit cluster state version [805]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1353) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1293) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:172) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.access$600(Publication.java:41) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication$PublicationTarget.onFaultyNode(Publication.java:275) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.lambda$onFaultyNode$2(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
        at org.elasticsearch.cluster.coordination.Publication.onFaultyNode(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.start(Publication.java:69) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1044) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: non-failed nodes do not form a quorum
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:170) ~[elasticsearch-7.2.1.jar:7.2.1]
        ... 18 more
[2020-06-30T07:30:16,174][ERROR][o.e.c.c.Coordinator      ] [node-2] unexpected failure during [node-left]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1353) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1293) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:172) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.access$600(Publication.java:41) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication$PublicationTarget.onFaultyNode(Publication.java:275) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.lambda$onFaultyNode$2(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
        at org.elasticsearch.cluster.coordination.Publication.onFaultyNode(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.start(Publication.java:69) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1044) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.1.jar:7.2.1]

Alex_Davidovich · July 12, 2020, 7:05am

node-3

[2020-06-30T07:29:56,876][INFO ][o.e.c.s.ClusterApplierService] [node-3] master node changed {previous [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory
=136798695424, ml.max_open_jobs=20, xpack.installed=true}], current []}, term: 60, version: 804, reason: becoming candidate: onLeaderFailure
[2020-06-30T07:30:11,779][WARN ][o.e.t.TransportService   ] [node-3] Received response for a request that has timed out, sent [31809ms] ago, timed out [26808ms] ago, action [internal:coordination/fault_de
tection/leader_check], node [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}], id [45879]
[2020-06-30T07:30:11,779][WARN ][o.e.t.TransportService   ] [node-3] Received response for a request that has timed out, sent [25808ms] ago, timed out [20806ms] ago, action [internal:coordination/fault_de
tection/leader_check], node [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}], id [45883]
[2020-06-30T07:30:11,780][WARN ][o.e.t.TransportService   ] [node-3] Received response for a request that has timed out, sent [19806ms] ago, timed out [14804ms] ago, action [internal:coordination/fault_de
tection/leader_check], node [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}], id [45884]
[2020-06-30T07:30:14,748][INFO ][o.e.c.s.MasterService    ] [node-3] elected-as-master ([2] nodes joined)[{node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memo
ry=136798695424, ml.max_open_jobs=20, xpack.installed=true} elect leader, {node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, xpack.installed
=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 62, version: 805, reason: master node changed {previous [], current [{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZ
TaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}]}
[2020-06-30T07:30:14,761][WARN ][o.e.c.s.MasterService    ] [node-3] failing [elected-as-master ([2] nodes joined)[{node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.mac
hine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true} elect leader, {node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, xpack.
installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_]]: failed to commit cluster state version [805]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 62 while handling publication
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1012) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-06-30T07:30:14,782][INFO ][o.e.c.c.JoinHelper       ] [node-3] failed to join {node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, xpack
.installed=true, ml.max_open_jobs=20} with JoinRequest{sourceNode={node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, xpack.installed=true, m
l.max_open_jobs=20}, optionalJoin=Optional[Join{term=61, lastAcceptedTerm=60, lastAcceptedVersion=804, sourceNode={node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.mach
ine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, targetNode={node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, xpack.ins
talled=true, ml.max_open_jobs=20}}]}
org.elasticsearch.transport.RemoteTransportException: [node-3][9.151.141.3:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 62 while handling publication
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1012) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]

DavidTurner · July 12, 2020, 7:22am

There's no real problem here, there was just some unlucky timing that needed a retry. From the docs on master elections:

Any master-eligible node can start an election, and normally the first election that takes place will succeed. Elections only usually fail when two nodes both happen to start their elections at about the same time, so elections are scheduled randomly on each node to reduce the probability of this happening.

Alex_Davidovich:

This is our configuration

cluster.publish.timeout: 15s
cluster.fault_detection.leader_check.timeout: 5s
cluster.fault_detection.follower_check.timeout: 5s
cluster.follower_lag.timeout: 10s

These are all expert settings and come with the following warning:

IMPORTANT: If you adjust these settings then your cluster may not form correctly or may become unstable or intolerant of certain failures.

Alex_Davidovich · July 12, 2020, 7:39am

Thank you for the quick reply.
The problem is that my cluster couldn't elect a master.
I guess you might blame our settings,
Can you tell what is the problem with the setting causing these errors:

[2020-06-30T07:30:41,612][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node-2] master not discovered or elected yet, an election requires at least 2 nodes with ids from [wilUD2IFS3m_QVnltjhJxQ, NfMKf8dU
Rpi2FwKKCQj1RA, 3up-5-IMTjqggCmNsLXX_w], have discovered [{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.inst
alled=true}, {node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery wil
l continue using [9.151.141.1:9300, 9.151.141.3:9300] from hosts providers and [{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}, {node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}, {node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 64, last-accepted version 804 in term 60
[2020-06-30T07:30:56,788][WARN ][o.e.c.InternalClusterInfoService] [node-2] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
[2020-06-30T07:30:56,787][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node-2] failed to execute on node [3up-5-IMTjqggCmNsLXX_w]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-3][9.151.141.3:9300][cluster:monitor/nodes/stats[n]] request_id [121306] timed out after [14805ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1013) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-06-30T07:31:11,788][WARN ][o.e.c.InternalClusterInfoService] [node-2] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2020-06-30T07:31:33,655][INFO ][o.e.c.c.JoinHelper       ] [node-2] failed to join {node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=61, lastAcceptedTerm=60, lastAcceptedVersion=804, sourceNode={node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, targetNode={node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-3][9.151.141.3:9300][internal:cluster/coordination/join] request_id [121209] timed out after [60017ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1013) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-06-30T07:31:33,656][INFO ][o.e.c.c.JoinHelper       ] [node-2] failed to join {node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=61, lastAcceptedTerm=60, lastAcceptedVersion=804, sourceNode={node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, targetNode={node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-3][9.151.141.3:9300][internal:cluster/coordination/join] request_id [121209] timed out after [60017ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1013) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-06-30T07:31:41,609][INFO ][o.e.c.c.JoinHelper       ] [node-2] failed to join {node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=64, lastAcceptedTerm=60, lastAcceptedVersion=804, sourceNode={node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, targetNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-1][9.151.141.1:9300][internal:cluster/coordination/join] request_id [121230] timed out after [60018ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1013) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-06-30T07:31:41,610][INFO ][o.e.c.c.JoinHelper       ] [node-2] failed to join {node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=64, lastAcceptedTerm=60, lastAcceptedVersion=804, sourceNode={node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, targetNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-1][9.151.141.1:9300][internal:cluster/coordination/join] request_id [121230] timed out after [60018ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1013) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]

DavidTurner · July 12, 2020, 8:19am

There's not enough information in these logs to say more than that something was timing out during the election. I suggest upgrading, newer versions offer more detailed logging and have some performance improvements too.

Not really, we don't do a lot of testing with non-default values for these settings, they're largely only there for experimentation. I strongly recommend setting them back to the defaults in a production environment.

Alex_Davidovich · July 12, 2020, 8:22am

Thank you very much

system · August 9, 2020, 8:22am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fail to elect a master following a failover due to GC Elasticsearch	2	783	August 4, 2020
Can not elect master when restarting cluster from 7.3 upgrade Elasticsearch	19	6842	October 2, 2019
Master election issue? Elasticsearch	4	374	July 6, 2017
Master election takes too long Elasticsearch	20	2014	May 16, 2019
Master election takes minutes Elasticsearch	4	988	June 4, 2021

Master election - got Failed To Commit Cluster State Exception

Related topics