We are using elastic 7.2.1
We have a 3 nodes cluster with all nodes being master eligible.
Our configuration is:
transport.connect_timeout: 1s
cluster.publish.timeout: 15s
cluster.fault_detection.leader_check.timeout: 5s
cluster.fault_detection.follower_check.timeout: 5s
cluster.follower_lag.timeout: 10s
We changed the defaults of these timeouts because it took too
long to remove a non-connected master from the cluster. (maybe this was not a very good decision...)
Our cluster was formed and elastic was working ok until node-2 had a long GC pause.
[2020-06-30T07:30:11.778+0000][3665][gc ] GC(248) Pause Young (Allocation Failure) 222M->86M(494M) 32547.255ms
[2020-06-30T07:30:11.778+0000][3665][gc,cpu ] GC(248) User=0.01s Sys=0.00s Real=32.54s
We don't know why it took so much time. It happens sometimes...
(we are aware of the fact that we set the heap size to only 512MB)
This pause caused a chain reaction which led the cluster to break and fail to re-form.
node-2
[2020-06-30T07:30:13,784][INFO ][o.e.c.s.MasterService ] [node-2] node-left[{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_ope
n_jobs=20, xpack.installed=true} followers check retry count exceeded], term: 60, version: 805, reason: removed {{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machi
ne_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true},}
[2020-06-30T07:30:14,614][INFO ][o.e.c.s.ClusterApplierService] [node-2] master node changed {previous [{node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory
=136798695424, xpack.installed=true, ml.max_open_jobs=20}], current []}, term: 60, version: 804, reason: becoming candidate: Publication.onCompletion(false)
[2020-06-30T07:30:14,614][WARN ][o.e.c.s.MasterService ] [node-2] failing [node-left[{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, m
l.max_open_jobs=20, xpack.installed=true} followers check retry count exceeded]]: failed to commit cluster state version [805]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1353) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1293) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:172) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.access$600(Publication.java:41) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication$PublicationTarget.onFaultyNode(Publication.java:275) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.lambda$onFaultyNode$2(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
at org.elasticsearch.cluster.coordination.Publication.onFaultyNode(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.start(Publication.java:69) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1044) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.2.1.jar:7.2.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: non-failed nodes do not form a quorum
at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:170) ~[elasticsearch-7.2.1.jar:7.2.1]
... 18 more
[2020-06-30T07:30:16,174][ERROR][o.e.c.c.Coordinator ] [node-2] unexpected failure during [node-left]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1353) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1293) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:172) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.access$600(Publication.java:41) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication$PublicationTarget.onFaultyNode(Publication.java:275) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.lambda$onFaultyNode$2(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
at org.elasticsearch.cluster.coordination.Publication.onFaultyNode(Publication.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Publication.start(Publication.java:69) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1044) ~[elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.2.1.jar:7.2.1]
.
.
.
[2020-06-30T07:30:41,612][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node-2] master not discovered or elected yet, an election requires at least 2 nodes with ids from [wilUD2IFS3m_QVnltjhJxQ, NfMKf8dU
Rpi2FwKKCQj1RA, 3up-5-IMTjqggCmNsLXX_w], have discovered [{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.inst
alled=true}, {node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery wil
l continue using [9.151.141.1:9300, 9.151.141.3:9300] from hosts providers and [{node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}, {node-1}{wilUD2IFS3m_QVnltjhJxQ}{6A_nX4wKRTWzo08iu9_avQ}{node-1}{9.151.141.1:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}, {node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 64, last-accepted version 804 in term 60
[2020-06-30T07:30:56,788][WARN ][o.e.c.InternalClusterInfoService] [node-2] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
[2020-06-30T07:30:56,787][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node-2] failed to execute on node [3up-5-IMTjqggCmNsLXX_w]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-3][9.151.141.3:9300][cluster:monitor/nodes/stats[n]] request_id [121306] timed out after [14805ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1013) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-06-30T07:31:11,788][WARN ][o.e.c.InternalClusterInfoService] [node-2] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2020-06-30T07:31:33,655][INFO ][o.e.c.c.JoinHelper ] [node-2] failed to join {node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=61, lastAcceptedTerm=60, lastAcceptedVersion=804, sourceNode={node-2}{NfMKf8dURpi2FwKKCQj1RA}{wuB2XEu3TwCngO0RTZL8LA}{node-2}{9.151.141.2:9300}{ml.machine_memory=136798695424, xpack.installed=true, ml.max_open_jobs=20}, targetNode={node-3}{3up-5-IMTjqggCmNsLXX_w}{v5QlLSzETtSZTaIHLOPusA}{node-3}{9.151.141.3:9300}{ml.machine_memory=136798695424, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-3][9.151.141.3:9300][internal:cluster/coordination/join] request_id [121209] timed out after [60017ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1013) [elasticsearch-7.2.1.jar:7.2.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.1.jar:7.2.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:835) [?:?]