Elasticsearch-master election is taking too long (+30mins)

zgergely · October 25, 2019, 6:40am

Will do a test right now. But how can I bring data to my data nodes so I could test this in our usual use scenario, not when the data nodes are empty? cause that way it is working...

Br,
Zoltan

DavidTurner · October 25, 2019, 7:24am

It's a bad idea to restore a cluster by copying or cloning data at the filesystem level as you seem to be doing. You should use snapshot and restore to import indices into an empty cluster instead.

zgergely · October 28, 2019, 9:16am

Hi @DavidTurner,

Managed to start the cluster with a workaround I guess, I will detail below:

Looks like from all the previous re-deploys of the same cluster the data paths of each elastic-master node was looking like this

root@test-pod-m1:/data/old_m1# cd nodes/

root@test-pod-m1:/data/old_m1/nodes# ls -latr
total 20
drwxr-xr-x 4 1000       1000 4096 Jun 24 12:39 ..
drwxrwxr-x 4 1000 4294967294 4096 Oct 14 12:51 0
drwxrwxr-x 4 1000 4294967294 4096 Oct 14 12:51 1
drwxrwxr-x 4 1000 4294967294 4096 Oct 14 12:51 2
drwxrwxr-x 5 1000 4294967294 4096 Oct 25 10:18 .

=> containing all the possible 3 UUIDs of the cluster

root@test-pod-m1:/data/old_m1/nodes/0/_state# ls -latr
total 48
-rw-rw-r-- 1 1000 4294967294    71 Oct 14 12:47 node-291.st
drwxrwxr-x 4 1000 4294967294  4096 Oct 14 12:51 ..
-rw-rw-r-- 1 1000 4294967294 23456 Oct 14 13:02 global-176.st
-rw-rw-r-- 1 1000 4294967294  6863 Oct 23 00:00 manifest-74201.st
drwxrwxr-x 2 1000 4294967294  4096 Oct 25 10:19 .

root@test-pod-m1:/data/old_m1/nodes/0/_state# cat 0/_state/node-291.st
?▒lstate:)
▒node_idUAfaPrC0YSVSTPsN7CM5pOQ▒▒(▒▒▒+▒


root@test-pod-m1:/data/old_m1/nodes/1/_state# ls -latr
total 48
-rw-rw-r-- 1 1000 4294967294    71 Oct 14 12:47 node-119.st
drwxrwxr-x 4 1000 4294967294  4096 Oct 14 12:51 ..
-rw-rw-r-- 1 1000 4294967294 23462 Oct 14 13:02 global-175.st
-rw-rw-r-- 1 1000 4294967294  6864 Oct 23 00:00 manifest-218378.st
drwxrwxr-x 2 1000 4294967294  4096 Oct 23 00:00 .

root@test-pod-m1:/data/old_m1/nodes/1/_state# cat 1/_state/node-119.st
?▒lstate:)
▒node_idUmOXnX6QwQDOeRDqsf4b3dg▒▒(▒▒▒▒


root@test-pod-m1:/data/old_m1/nodes/2/_state# ls -latr
total 48
-rw-rw-r-- 1 1000 4294967294    71 Oct 14 12:47 node-55.st
drwxrwxr-x 4 1000 4294967294  4096 Oct 14 12:51 ..
-rw-rw-r-- 1 1000 4294967294 23443 Oct 14 13:02 global-145.st
-rw-rw-r-- 1 1000 4294967294  6862 Oct 23 00:00 manifest-7568.st
drwxrwxr-x 2 1000 4294967294  4096 Oct 23 00:00 .

root@test-pod-m1:/data/old_m1/nodes/2/_state# cat 2/_state/node-55.st
?▒lstate:)
▒node_idUHgRBrqoTQqq_sUTFHbbYYw▒▒(▒▒U▒

=> @startup on the old storage somehow the cluster was aware of which of the 3 folders to choose with one of the 3 UUIDs for each elastic-master node BUT on the new storage after I have copied the data the cluster was not aware what folder to choose and for example in the previous reply he was choosing folder 0 on each of the 3 elastic-master nodes => the UUID would be "AfaPrC0YSVSTPsN7CM5pOQ" on all 3 of them...

Workaround: I have deleted all the folders from the data path , created a folder 0 on each node and added the data for each specific UUID on each specific node:

[root@elasticsearch-master-0 data]# cd nodes/
[root@elasticsearch-master-0 nodes]# ls -latr
total 12
drwxrwsr-x 5 root elasticsearch 4096 Oct 25 09:04 ..
drwxrwsr-x 3 root elasticsearch 4096 Oct 25 11:29 .
drwxrwsr-x 4 root elasticsearch 4096 Oct 25 11:36 0
[root@elasticsearch-master-0 nodes]# cat 0/_state/node-292.st
?▒lstate:)
▒node_idUAfaPrC0YSVSTPsN7CM5pOQ▒▒(▒▒▒+▒


[root@elasticsearch-master-1 data]# cd nodes/
[root@elasticsearch-master-1 nodes]# ls -latr
total 12
drwxrwsr-x 5 root elasticsearch 4096 Oct 25 09:04 ..
drwxrwsr-x 3 root elasticsearch 4096 Oct 25 11:30 .
drwxrwsr-x 4 root elasticsearch 4096 Oct 25 11:36 0
[root@elasticsearch-master-1 nodes]# cat 0/_state/node-120.st
?▒lstate:)
▒node_idUmOXnX6QwQDOeRDqsf4b3dg▒▒(▒▒▒▒


[root@elasticsearch-master-2 data]# cd nodes/
[root@elasticsearch-master-2 nodes]# ls -latr
total 12
drwxrwsr-x 5 root elasticsearch 4096 Oct 25 09:04 ..
drwxrwsr-x 3 root elasticsearch 4096 Oct 25 11:30 .
drwxrwsr-x 4 root elasticsearch 4096 Oct 25 11:36 0
[root@elasticsearch-master-2 nodes]# cat 0/_state/node-56.st
?▒lstate:)
▒node_idUHgRBrqoTQqq_sUTFHbbYYw▒▒(▒▒U▒

Re-deployed the cluster, election was successful and readiness/liveness probes OK after a total of 5 mins - totally acceptable for us

Let me know if it is ok from your side to let this cluster setup in use with the mentioned workarounds.

Thank you,
Gergely Zoltan

DavidTurner · October 28, 2019, 12:46pm

No, I don't really recommend doing anything like this. You should be starting up nodes with empty data directories and restoring data from snapshots. You shouldn't be cloning nodes like this. There's a real risk that these cloned nodes might discover each other at some point and this will result in serious data loss.

You should also not be using the deprecated setting nodes.max_local_storage_nodes which is a further source of confusion here. It's much simpler to give each node its own data path instead.

zgergely · November 10, 2019, 5:23pm

Hi @DavidTurner,

Thanks for the assistance.
In our case the old data(cluster) don`t exist anymore so it is safe to run this as it is.

Created another topic for ELK Resource requirements:

Can you please take a look on it when you have some time?

Thank you,
Zoltan

system · December 8, 2019, 5:23pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Getting “master not discovered or elected yet” causing cluster not up in version 7.9.1 Elasticsearch	21	4185	November 7, 2020
Can not elect master when restarting cluster from 7.3 upgrade Elasticsearch	19	6838	October 2, 2019
Getting "master not discovered or elected yet" causing cluster not up in version 7.1.0 Elasticsearch	28	15239	July 1, 2019
ElasticSearch Unstable Elasticsearch	13	7109	July 4, 2018
ES failure for few seconds during master re-elect Elasticsearch	4	547	July 6, 2017

Elasticsearch-master election is taking too long (+30mins)

Related topics