Nodes fail to join cluster after full cluster restart (cluster uuid mismatch?)

tomhe · November 18, 2019, 10:56am

I was forced to stop all nodes in our cluster and now I can't bring the cluster back up. Looks like the there is a cluster uuid mismatch, but I don't know why this has happened or how to fix it.

"type": "server", "timestamp": "2019-11-18T10:46:02,609Z", "level": "WARN", "component": "o.e.c.c.Coordinator", "cluster.name": "docker-cluster", "node.name": "node-002", "message": "failed to validate incoming join request from node [{node-012}{GrpvmVyVSOm2UpZQIUa3pg}{guMNX7HRT0q8-Lx7HL0Cnw}{10.33.9.82}{10.33.9.82:9300}{dil}{ml.machine_memory=67388260352, ml.max_open_jobs=20, xpack.installed=true}]", "cluster.uuid": "fQo4028sSN-QWcCaG2w_ZA", "node.id": "v9st6CCkQyioc6YFMbj3Mg" , 
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [node-012][172.19.0.2:9300][internal:cluster/coordination/join/validate]",
"Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid fQo4028sSN-QWcCaG2w_ZA than local cluster uuid i44zLmaER4ipQYj-F9QVDw, rejecting",

The data folder on each node is unchanged. What can I do to bring up the cluster again?

DavidTurner · November 18, 2019, 11:04am

The cluster UUID is stored on disk on all master-eligible nodes and on all data nodes, and must match to prevent nodes from joining a different cluster since this is a good way to lose data. The usual way to get to this exception is to be using ephemeral storage on the master-eligible nodes. If the cluster UUID is missing on the master-eligible nodes then they will invent a new one, but this indicates that they have lost the cluster metadata too which means the data on your data nodes cannot be read correctly. If so, the safest way to proceed is to fix the storage on the master nodes to persist across restarts and then restore your data from a recent snapshot.

tomhe · November 18, 2019, 11:16am

We’re using persistent storage on all nodes including the master-eligible nodes.

Is a full restore my only option? This is our production cluster and a full restore will take too long time.

Maybe also worth noting: I’ve done a successful full cluster restart previously.

DavidTurner · November 18, 2019, 11:20am

What exact version are you using?

Can you grep all your logs on all nodes for INFO messages containing the string cluster UUID going back as far as possible, and share those logs here (or on https://gist.github.com if they don't fit here).

tomhe · November 18, 2019, 12:29pm

OK, so your initial guess was right: The data folder was missing.

During the time that the cluster was offline, a cron job (that I was unaware of, running docker system prune --volumes --force) removed the Docker volume containing /usr/share/elasticsearch/data on four of our nodes, including the eligible master nodes.

We're looking into the possibility of restoring /var/lib/docker (and hopefully the volumes) from backup. Would this be a bad idea? Will this leave our cluster in an inconsistent state? Four of the nodes would be using old data folders.

DavidTurner · November 18, 2019, 12:38pm

Yep that'd do it.

It is risky to try and restore from a filesystem backup and I can't recommend it in good conscience, since it will take some of your nodes "back in time" and the effects of this are undefined. It may result in lost data (possibly silently) or may render some of your indices unreadable.

tomhe · November 18, 2019, 12:39pm

We have nightly snapshots of our data. Is there a guide on how to restore a full cluster (including security data) from scratch from a snapshot?

DavidTurner · November 18, 2019, 1:51pm

I don't know of anything more specific than the restore docs. You may need to disable some components (e.g. Kibana, monitoring, watcher, rollups, ...) for the duration of the restore since they may otherwise create indices that block the restore, and you'll need to use a security realm other than native since the native realm uses the .security index that you'll be restoring.

tomhe · November 18, 2019, 2:55pm

Thanks!

code-chris · November 25, 2019, 5:36pm

In which path is this UUID and the cluster metadata stored? In the data-directory of the node?
If not, then this means, that a full cluster restart would always fail in a K8s environment cause the filesystem of pods is always ephemeral (if not mounted by volumes)...

DavidTurner · November 25, 2019, 7:06pm

Yes.

system · December 23, 2019, 7:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failed Node Join in Elasticsearch Cluster "jaeger-es" due to Cluster UUID Mismatch Elasticsearch	13	425	November 24, 2023
Failed to join cluster Elasticsearch	5	9838	July 25, 2019
Failed to validate incoming join request from node Elasticsearch	1	1089	May 16, 2021
Node Discovery Elasticsearch 7.0.0 Elasticsearch	13	2905	June 6, 2019
Unable to join ES node with different cluster_uuid Elasticsearch	3	777	August 31, 2018

Nodes fail to join cluster after full cluster restart (cluster uuid mismatch?)

Related topics