Fail to elect a master following a failover due to GC

Alex_Davidovich · July 7, 2020, 3:03pm

We are using elastic 7.2.1
We have a 3 nodes cluster with all nodes being master eligible.
Our configuration is:

transport.connect_timeout: 1s
cluster.publish.timeout: 15s
cluster.fault_detection.leader_check.timeout: 5s
cluster.fault_detection.follower_check.timeout: 5s
cluster.follower_lag.timeout: 10s

We changed the defaults of these timeouts because it took too
long to remove a non-connected master from the cluster. (maybe this was not a very good decision...)

Our cluster was formed and elastic was working ok until node-2 had a long GC pause.

[2020-06-30T07:30:11.778+0000][3665][gc ] GC(248) Pause Young (Allocation Failure) 222M->86M(494M) 32547.255ms
[2020-06-30T07:30:11.778+0000][3665][gc,cpu ] GC(248) User=0.01s Sys=0.00s Real=32.54s

We don't know why it took so much time. It happens sometimes...
(we are aware of the fact that we set the heap size to only 512MB)

This pause caused a chain reaction which led the cluster to break and fail to re-form.

node-2

[2020-06-30T07:30:13,784][INFO ][o.e.c.s.MasterService    ] [node-2] node-left[{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_ope
n_jobs=20, xpack.installed=true} followers check retry count exceeded], term: 60, version: 805, reason: removed {{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machi
ne_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true},}
[2020-06-30T07:30:14,614][INFO ][o.e.c.s.ClusterApplierService] [node-2] master node changed {previous [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory
=136798695424, xpack.installed=true, ml.max_open_jobs=20}], current []}, term: 60, version: 804, reason: becoming candidate: Publication.onCompletion(false)
[2020-06-30T07:30:14,614][WARN ][o.e.c.s.MasterService    ] [node-2] failing [node-left[{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, m
l.max_open_jobs=20, xpack.installed=true} followers check retry count exceeded]]: failed to commit cluster state version [805]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1353) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1293) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:172) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.access$600(Publication.java:41) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication$PublicationTarget.onFaultyNode(Publication.java:275) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.lambda$onFaultyNode$2(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
        at org.elasticsearch.cluster.coordination.Publication.onFaultyNode(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.start(Publication.java:69) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1044) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: non-failed nodes do not form a quorum
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:170) ~[elasticsearch-7.2.1.jar:7.2.1]
        ... 18 more
[2020-06-30T07:30:16,174][ERROR][o.e.c.c.Coordinator      ] [node-2] unexpected failure during [node-left]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1353) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1293) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:172) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.access$600(Publication.java:41) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication$PublicationTarget.onFaultyNode(Publication.java:275) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.lambda$onFaultyNode$2(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
        at org.elasticsearch.cluster.coordination.Publication.onFaultyNode(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Publication.start(Publication.java:69) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1044) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.1.jar:7.2.1]
.
.
.
[2020-06-30T07:30:41,612][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node-2] master not discovered or elected yet, an election requires at least 2 nodes with ids from [wilUD2IFS3m_QVnltjhJxQ, NfMKf8dU
Rpi2FwKKCQj1RA, 3up-5-IMTjqggCmNsLXX_w], have discovered [{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.inst
alled=true}, {node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery wil
l continue using [9.151.141.1:9300, 9.151.141.3:9300] from hosts providers and [{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}, {node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}, {node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 64, last-accepted version 804 in term 60
[2020-06-30T07:30:56,788][WARN ][o.e.c.InternalClusterInfoService] [node-2] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
[2020-06-30T07:30:56,787][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node-2] failed to execute on node [3up-5-IMTjqggCmNsLXX_w]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-3][9.151.141.3:9300][cluster:monitor/nodes/stats[n]] request_id [121306] timed out after [14805ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1013) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-06-30T07:31:11,788][WARN ][o.e.c.InternalClusterInfoService] [node-2] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2020-06-30T07:31:33,655][INFO ][o.e.c.c.JoinHelper       ] [node-2] failed to join {node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=61, lastAcceptedTerm=60, lastAcceptedVersion=804, sourceNode={node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, targetNode={node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-3][9.151.141.3:9300][internal:cluster/coordination/join] request_id [121209] timed out after [60017ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1013) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]

Alex_Davidovich · July 7, 2020, 3:05pm

node-1

[2020-06-30T07:29:56,793][INFO ][o.e.c.s.ClusterApplierService] [node-1] master node changed {previous [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory
=136798695424, ml.max_open_jobs=20, xpack.installed=true}], current []}, term: 60, version: 804, reason: becoming candidate: onLeaderFailure
[2020-06-30T07:30:11,779][WARN ][o.e.t.TransportService   ] [node-1] Received response for a request that has timed out, sent [32009ms] ago, timed out [27007ms] ago, action [internal:coordination/fault_de
tection/leader_check], node [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}], id [43286]
[2020-06-30T07:30:11,779][WARN ][o.e.t.TransportService   ] [node-1] Received response for a request that has timed out, sent [26007ms] ago, timed out [21006ms] ago, action [internal:coordination/fault_de
tection/leader_check], node [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}], id [43287]
[2020-06-30T07:30:11,780][WARN ][o.e.t.TransportService   ] [node-1] Received response for a request that has timed out, sent [20005ms] ago, timed out [15004ms] ago, action [internal:coordination/fault_de
tection/leader_check], node [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}], id [43288]
[2020-06-30T07:30:14,785][INFO ][o.e.c.c.JoinHelper       ] [node-1] failed to join {node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.ma
x_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, xpack.installed=true, m
l.max_open_jobs=20}, optionalJoin=Optional[Join{term=61, lastAcceptedTerm=60, lastAcceptedVersion=804, sourceNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.mach
ine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, targetNode={node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_op
en_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.RemoteTransportException: [node-3][9.151.141.3:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 62 while handling publication
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1012) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) ~[elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-06-30T07:30:17,053][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node-1] master not discovered or elected yet, an election requires at least 2 nodes with ids from [wilUD2IFS3m_QVnltjhJxQ, NfMKf8dU
Rpi2FwKKCQj1RA, 3up-5-IMTjqggCmNsLXX_w], have discovered [{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.inst
alled=true}, {node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery wil
l continue using [9.151.141.2:9300, 9.151.141.3:9300] from hosts providers and [{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_op
en_jobs=20, xpack.installed=true}, {node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, {node-2}{N
fMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 62, last-accep
ted version 804 in term 60
[2020-06-30T07:30:25,063][INFO ][o.e.c.c.JoinHelper       ] [node-1] failed to join {node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, xpack
.installed=true, ml.max_open_jobs=20} with JoinRequest{sourceNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, xpack.installed=true, m
l.max_open_jobs=20}, optionalJoin=Optional[Join{term=62, lastAcceptedTerm=60, lastAcceptedVersion=804, sourceNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.mach
ine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, targetNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, xpack.ins
talled=true, ml.max_open_jobs=20}}]}
org.elasticsearch.transport.RemoteTransportException: [node-1][9.151.141.1:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: incoming term 62 does not match current term 63
        at org.elasticsearch.cluster.coordination.CoordinationState.handleJoin(CoordinationState.java:218) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:942) ~[elasticsearch-7.2.1.jar:7.2.1]
        at java.util.Optional.ifPresent(Optional.java:183) ~[?:?]
        at org.elasticsearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:511) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Coordinator.handleJoinRequest(Coordinator.java:478) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$0(JoinHelper.java:124) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:250) [x-pack-security-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:308) [x-pack-security-7.2.1.ja
r:7.2.1]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:703) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-06-30T07:30:27,054][INFO ][o.e.c.c.JoinHelper       ] [node-1] last failed join attempt was 1.9s ago, failed to join {node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20} with JoinRequest{sourceNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=62, lastAcceptedTerm=60, lastAcceptedVersion=804, sourceNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, targetNode={node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}}]}
org.elasticsearch.transport.RemoteTransportException: [node-1][9.151.141.1:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: incoming term 62 does not match current term 63
        at org.elasticsearch.cluster.coordination.CoordinationState.handleJoin(CoordinationState.java:218) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:942) ~[elasticsearch-7.2.1.jar:7.2.1]
        at java.util.Optional.ifPresent(Optional.java:183) ~[?:?]
        at org.elasticsearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:511) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.Coordinator.handleJoinRequest(Coordinator.java:478) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$0(JoinHelper.java:124) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:250) ~[?:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:308) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:703) ~[elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) [elasticsearch-7.2.1.jar:7.2.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.2.1.jar:7.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-06-30T07:30:27,056][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node-1] master not discovered or elected yet, an election requires at least 2 nodes with ids from [wilUD2IFS3m_QVnltjhJxQ, NfMKf8dURpi2FwKKCQj1RA, 3up-5-IMTjqggCmNsLXX_w], have discovered [{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}, {node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [9.151.141.2:9300, 9.151.141.3:9300] from hosts providers and [{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}, {node-1}{wilUD2IFS3m_QVnltjhJxQ}

system · August 4, 2020, 3:05pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Master election - got Failed To Commit Cluster State Exception Elasticsearch	6	5368	August 9, 2020
Elected master node doesn't communicate with other master nodes within expected time Elasticsearch	3	384	June 3, 2020
It takes up to 10 minutes to join new master Elasticsearch	5	774	September 9, 2019
Increased node timeouts after 7.7.0 to 7.10.1 upgrade Elasticsearch	3	709	January 16, 2021
Leader_check - time out? Elasticsearch	1	558	July 22, 2020

Fail to elect a master following a failover due to GC

Related topics