I am deploying ELK 7.0.1 in kubernetes environment via helm.
I have: 3 master pods, 2 data and 3 client pods.
efk-belk-elasticsearch-client-5f8455d6c5-wqggx
efk-belk-elasticsearch-data-0
efk-belk-elasticsearch-master-0
efk-belk-elasticsearch-master-1
efk-belk-elasticsearch-master-2
My master pods are a part of statefulset and CLUSTER_INITIAL_MASTER_NODES is configured as the list of hostnames of all master-eligible pods.
On installation, cluster works fine.
But if I upgrade something to the chart which causes the pods to get deleted, new master and data pods that come up fail to join the cluster.
One such scenario:
On cluster installation, master-2 was the elected master.
192.168.1.3 4 98 4 0.09 0.21 0.49 i - efk-belk-elasticsearch-client-5f8455d6c5-wqggx
192.168.1.13 10 98 4 0.09 0.21 0.49 mi * efk-belk-elasticsearch-master-2
192.168.1.221 11 95 6 0.27 0.28 0.63 mi - efk-belk-elasticsearch-master-1
192.168.1.30 11 95 5 0.25 0.62 0.83 mi - efk-belk-elasticsearch-master-0
192.168.1.59 5 95 5 0.25 0.62 0.83 di - efk-belk-elasticsearch-data-0
I changed the service name of elasticsearch & upgraded the chart. As this name is an env in master pod, it deleted master pods to create new ones. The elected master (master-2) only got deleted first as it is a statefulset.
After that, master-0 was elected as the master.
192.168.1.3 7 98 10 1.71 0.79 0.64 i - efk-belk-elasticsearch-client-5f8455d6c5-wqggx
192.168.1.13 mi - efk-belk-elasticsearch-master-2
192.168.1.221 12 97 7 0.19 0.19 0.49 mi - efk-belk-elasticsearch-master-1
192.168.1.30 9 89 7 0.29 0.39 0.67 mi * efk-belk-elasticsearch-master-0
192.168.1.59 di - efk-belk-elasticsearch-data-0
Then, master1 got deleted and finally master-0 got deleted at the end due to upgrade. This results in the foll exception on accessing the cluster:
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}
Since then, the cluster is totally unresponsive and even though all master pods have come up, the cluster is not formed.
I find these error logs in master-1 pod.
"logger":"o.e.c.c.Coordinator","timezone":"UTC","marker":"[efk-belk-elasticsearch-master-1] ","log":"failed to validate incoming join request from node [{efk-belk-elasticsearch-client-5f8455d6c5-wqggx}{RNeNMC3HTZ2tGDFxf4OrQQ}{o_s3tSR5RveFYhomgAj_UQ}{192.168.1.3}{192.168.1.3:9300}]"}
org.elasticsearch.transport.RemoteTransportException: [efk-belk-elasticsearch-client-5f8455d6c5-wqggx][192.168.1.3:9300][internal:cluster/coordination/join/validate]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid TAiCOKecSwawqLc1WPOniQ than local cluster uuid slo-5iTvQm6N1cAHz4Vt2w, rejecting
at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$4(JoinHelper.java:147) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1077) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.0.1.jar:7.0.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
And this error in master-2 pod:
{"type":"log","host":"efk-belk-elasticsearch-master-2","level":"WARN","systemid":"a6aab87097d14ff5b192d56f1d73ff1a","system":"BELK","time": "2019-07-04T00:47:58.853Z","logger":"o.e.c.c.ClusterFormationFailureHelper","timezone":"UTC","marker":"[efk-belk-elasticsearch-master-2] ","log":"master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [efk-belk-elasticsearch-master-0, efk-belk-elasticsearch-master-1, efk-belk-elasticsearch-master-2] to bootstrap a cluster: have discovered ; discovery will continue using [10.254.70.48:9300] from hosts providers and [{efk-belk-elasticsearch-master-2}{jkxPUt86RW-wOP0UcfC-IQ}{awau0gi6Sgm1WHYsOQhvgQ}{192.168.1.18}{192.168.1.18:9300}] from last-known cluster state; node term 0, last-accepted version 0 in term 0"}
Can you tell me what is causing this issue and how i can resolve this?
I am frequently seeing cluster-joining issues with data pods also when pods are deleted and recreated.