Master not discovered exception with ELK 7

I am deploying ELK 7.0.1 in kubernetes environment via helm.
I have: 3 master pods, 2 data and 3 client pods.

efk-belk-elasticsearch-client-5f8455d6c5-wqggx
efk-belk-elasticsearch-data-0
efk-belk-elasticsearch-master-0
efk-belk-elasticsearch-master-1
efk-belk-elasticsearch-master-2

My master pods are a part of statefulset and CLUSTER_INITIAL_MASTER_NODES is configured as the list of hostnames of all master-eligible pods.

On installation, cluster works fine.
But if I upgrade something to the chart which causes the pods to get deleted, new master and data pods that come up fail to join the cluster.

One such scenario:
On cluster installation, master-2 was the elected master.

192.168.1.3 4 98 4 0.09 0.21 0.49 i - efk-belk-elasticsearch-client-5f8455d6c5-wqggx
192.168.1.13 10 98 4 0.09 0.21 0.49 mi * efk-belk-elasticsearch-master-2
192.168.1.221 11 95 6 0.27 0.28 0.63 mi - efk-belk-elasticsearch-master-1
192.168.1.30 11 95 5 0.25 0.62 0.83 mi - efk-belk-elasticsearch-master-0
192.168.1.59 5 95 5 0.25 0.62 0.83 di - efk-belk-elasticsearch-data-0

I changed the service name of elasticsearch & upgraded the chart. As this name is an env in master pod, it deleted master pods to create new ones. The elected master (master-2) only got deleted first as it is a statefulset.
After that, master-0 was elected as the master.

192.168.1.3 7 98 10 1.71 0.79 0.64 i - efk-belk-elasticsearch-client-5f8455d6c5-wqggx
192.168.1.13 mi - efk-belk-elasticsearch-master-2
192.168.1.221 12 97 7 0.19 0.19 0.49 mi - efk-belk-elasticsearch-master-1
192.168.1.30 9 89 7 0.29 0.39 0.67 mi * efk-belk-elasticsearch-master-0
192.168.1.59 di - efk-belk-elasticsearch-data-0

Then, master1 got deleted and finally master-0 got deleted at the end due to upgrade. This results in the foll exception on accessing the cluster:

{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}

Since then, the cluster is totally unresponsive and even though all master pods have come up, the cluster is not formed.

I find these error logs in master-1 pod.

"logger":"o.e.c.c.Coordinator","timezone":"UTC","marker":"[efk-belk-elasticsearch-master-1] ","log":"failed to validate incoming join request from node [{efk-belk-elasticsearch-client-5f8455d6c5-wqggx}{RNeNMC3HTZ2tGDFxf4OrQQ}{o_s3tSR5RveFYhomgAj_UQ}{192.168.1.3}{192.168.1.3:9300}]"}
org.elasticsearch.transport.RemoteTransportException: [efk-belk-elasticsearch-client-5f8455d6c5-wqggx][192.168.1.3:9300][internal:cluster/coordination/join/validate]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid TAiCOKecSwawqLc1WPOniQ than local cluster uuid slo-5iTvQm6N1cAHz4Vt2w, rejecting
at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$4(JoinHelper.java:147) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1077) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.0.1.jar:7.0.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

And this error in master-2 pod:

{"type":"log","host":"efk-belk-elasticsearch-master-2","level":"WARN","systemid":"a6aab87097d14ff5b192d56f1d73ff1a","system":"BELK","time": "2019-07-04T00:47:58.853Z","logger":"o.e.c.c.ClusterFormationFailureHelper","timezone":"UTC","marker":"[efk-belk-elasticsearch-master-2] ","log":"master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [efk-belk-elasticsearch-master-0, efk-belk-elasticsearch-master-1, efk-belk-elasticsearch-master-2] to bootstrap a cluster: have discovered ; discovery will continue using [10.254.70.48:9300] from hosts providers and [{efk-belk-elasticsearch-master-2}{jkxPUt86RW-wOP0UcfC-IQ}{awau0gi6Sgm1WHYsOQhvgQ}{192.168.1.18}{192.168.1.18:9300}] from last-known cluster state; node term 0, last-accepted version 0 in term 0"}

Can you tell me what is causing this issue and how i can resolve this?
I am frequently seeing cluster-joining issues with data pods also when pods are deleted and recreated.

Your master nodes formed a new cluster with a different cluster UUID when they restarted. The cluster UUID is immutable and stored in the data path, so this probably means the data path for the master nodes was wiped during the restart.

How can I prevent the wiping out of the data path on restart?

Is this cluster status data stored inside master pod? Do you mean I need to persist content of /data of master-pod using persistent volumes (like cinder)?

Are you referring to the files at /data/data/nodes/0/_state/ ?
rw-rw-r--. 1 elasticsearch elasticsearch 893 Jul 4 00:38 global-1.st
-rw-rw-r--. 1 elasticsearch elasticsearch 228 Jul 4 06:40 manifest-11.st
-rw-rw-r--. 1 elasticsearch elasticsearch 71 Jul 4 06:40 node-2.st

It's important that everything under the path given by the path.data setting remains in place across restarts of master-eligible nodes (as well as for data nodes). In your case that looks like /data/data. From the docs:

Master nodes must have access to the data/ directory (just like data nodes) as this is where the cluster state is persisted between node restarts.

1 Like

I am a bit confused among these two statements from the docs:

Master nodes must have access to the data/ directory (just like data nodes) as this is where the cluster state is persisted between node restarts.

and

Never run different node types (i.e. master, data) from the same data directory. This can lead to unexpected data loss.

Should the master & data pods access the same persistent volume mounted to both at path.data ?
Or does this mean, the content of path.data directory in both data and master pods should be persisted, but, in separate PVs?

Each node should have its own data directory.

The "never run different node types from the same data directory" statement is in the section about the deprecated node.max_local_storage_nodes setting. Do not use this setting, and then you need not worry about this section.

@DavidTurner Thanks for the response.
I was able to deploy a cluster without any issues after persisting master pod's data dir. Just a question - can you suggest what size of persistent storage should be provisioned for master pod? Generally, how much storage space would be needed by the cluster-state data stored in master?

It enormously depends on how you are using the cluster, but dedicated master nodes typically need much less space than data nodes. The safest thing to do is to measure it in your use case, size the volumes appropriately, and monitor the disk usage to make sure it doesn't grow too much over time.

Thanks @DavidTurner for the help.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.