Fix broken cluster with elasticsearch-node detach-cluster not working

I want to fix a broken cluster (when a node cannot join because differente cluster UUID by exemple) without removing all data folder (the dirty solution I can see every time....), currently I am testing it, using official elasticsearch Helm chart:

  1. I created a 3 master nodes cluster
  2. I delete the Helm release (pods are removed but volumes stay), and change the cluster name to break the cluster
  3. When re-creating the Helm release, cluster is broken as expected
  4. I remove all running pods by scaling to 0
  5. I run yes | elasticsearch-node detach-cluster; yes | elasticsearch-node remove-customs * on all volume
  6. I re-up all pods by scaling to 3

Clustering should working but not:

{"type": "server", "timestamp": "2021-01-04T09:41:30,190Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "test2", "node.name": "es-test-master-2", "message": "master not discovered yet and this node was detached from its previous cluster, have discovered [{es-test-master-2}{-FGfHUJgRwGEYkXgjFwiGQ}{0QWnT4lIToKQY4_jx6rV-w}{10.233.116.171}{10.233.116.171:9300}{m}{xpack.installed=true, transform.node=false}, {es-test-master-0}{yD4GKy3JSUmV1NW2mcLAtw}{v5eS-gMiRi-OaC88GYcoRw}{10.233.82.166}{10.233.82.166:9300}{m}{xpack.installed=true, transform.node=false}, {es-test-master-1}{7JZ9qyhATPq5ZCejHEkx3g}{UjJLnb6gSR22bY6QZWZfLA}{10.233.110.17}{10.233.110.17:9300}{m}{xpack.installed=true, transform.node=false}]; discovery will continue using [10.233.110.17:9300, 10.233.82.166:9300] from hosts providers and [{es-test-master-2}{-FGfHUJgRwGEYkXgjFwiGQ}{0QWnT4lIToKQY4_jx6rV-w}{10.233.116.171}{10.233.116.171:9300}{m}{xpack.installed=true, transform.node=false}] from last-known cluster state; node term 0, last-accepted version 32 in term 0" }
{"type": "server", "timestamp": "2021-01-04T09:41:35,486Z", "level": "WARN", "component": "r.suppressed", "cluster.name": "test2", "node.name": "es-test-master-2", "message": "path: /_cluster/health, params: {wait_for_status=green, timeout=1s}",
"stacktrace": ["org.elasticsearch.discovery.MasterNotDiscoveredException: null",
"at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:230) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) [elasticsearch-7.10.1.jar:7.10.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]"] }

How can I fix it ?

Removing the data folder is the only clean solution. You should do that.

1 Like

Why https://www.elastic.co/guide/en/elasticsearch/reference/current/node-tool.html is not working as expected?

So with elasticsearch, we must be agree to potentially loose data if we just change the cluster name ?

No, that's not the case at all. Changing the cluster name risks no data loss, nor does it need you to run elasticsearch-node. (That's a tautology: elasticsearch-node always risks data loss)

changing cluster Helm is just an exemple to break the cluster (what I did with the Helm chart) in my step 2.

The code behind detach-cluster seem reset the cluster state fine:

Hooo I have just seen MUST_JOIN_ELECTED_MASTER !!

Changing the cluster name also doesn't break the cluster. You mention "node cannot join because different cluster UUID", that's nothing to do with the cluster name.

I don't think this means what you think it means. You really should not be using elasticsearch-node detach-cluster. I quote its output here:

You should only run this tool if you have permanently lost all of the
master-eligible nodes in this cluster and you cannot restore the cluster
from a snapshot, or you have already unsafely bootstrapped a new cluster
by running `elasticsearch-node unsafe-bootstrap` on a master-eligible
node that belonged to the same cluster as this node. This tool can cause
arbitrary data loss and its use should be your last resort.

Repeating: This tool can cause arbitrary data loss. If you are concerned about possible data loss then you should not be considering using this tool.

As my understanding, the cluster state is stored as file, (as mysql/mongo etc... do), and what I learn is always possible to change file :slightly_smiling_face:

What about if I use EMPTY_CONFIG = new VotingConfiguration(Collections.emptySet()); instead of MUST_JOIN_ELECTED_MASTER , cluster state should be really reseted ?

Yes, changing cluster name make clustering impossible:

{"type": "server", "timestamp": "2021-01-04T10:40:04,210Z", "level": "WARN", "component": "o.e.d.HandshakingTransportAddressConnector", "cluster.name": "test3", "node.name": "es-test-master-2", "message": "handshake failed for [connectToRemoteMasterNode[10.233.80.53:9300]]",
"stacktrace": ["java.lang.IllegalStateException: handshake with [{10.233.80.53:9300}{A3vTBniwSfSfO8WvWeQiiA}{es-test-master-headless}{10.233.80.53:9300}] failed: remote cluster name [test2] does not match local cluster name [test3]",
"at org.elasticsearch.transport.TransportService$5.onResponse(TransportService.java:471) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.transport.TransportService$5.onResponse(TransportService.java:466) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1171) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1171) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:253) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:247) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) [elasticsearch-7.10.1.jar:7.10.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]"] }

hence, if I want to upgrade the cluster name with Helm chart, I need to "reset" former nodes (hence pods, hence volumes, hence data folder)

That would either result in a broken cluster or else directly lead to data loss too, I'm not sure which.

Perhaps you should take a step back and describe what you're actually trying to do here. Renaming a cluster doesn't break it, but a cluster reporting UUID mismatches hasn't just been renamed and has likely already lost data. Protecting against data loss is the reason for checking that the cluster UUID matches, there is simply no way to bypass that check without risking data loss.

That's not the case, but you do have to use the same cluster name on all the nodes in the cluster!

sometime we want to change the cluster name, or to recover from a split brain etc...

Well, I managed it to fix it, by using elasticsearch-node unsafe-bootstrap on one volume, so here the k8s job I run if it can help:

apiVersion: batch/v1
kind: Job
metadata:
  name: test-fix-cluster-m[0-1]
  namespace: dev-steam
spec:
  template:
    spec:
      containers:
      - args:
        - -c
        - yes | elasticsearch-node detach-cluster; yes | elasticsearch-node remove-customs '*'
        command:
        - /bin/sh
        image: docker.elastic.co/elasticsearch/elasticsearch:7.10.1
        name: elasticsearch
        volumeMounts:
        - mountPath: /usr/share/elasticsearch/data
          name: es-data
      restartPolicy: Never
      volumes:
      - name: es-data
        persistentVolumeClaim:
          claimName: es-test-master-es-test-master-[0-1]
apiVersion: batch/v1
kind: Job
metadata:
  name: test-fix-cluster-m2
  namespace: dev-steam
spec:
  template:
    spec:
      containers:
      - args:
        - -c
        - yes | elasticsearch-node unsafe-bootstrap -v
        command:
        - /bin/sh
        image: docker.elastic.co/elasticsearch/elasticsearch:7.10.1
        name: elasticsearch
        volumeMounts:
        - mountPath: /usr/share/elasticsearch/data
          name: es-data
      restartPolicy: Never
      volumes:
      - name: es-data
        persistentVolumeClaim:
          claimName: es-test-master-es-test-master-2

I don't know how to emphasise any more strongly that what you are doing is dangerous and will eventually result in data loss. The process you are describing doesn't safely recover from a split brain, in fact it bypasses the safety checks that are there to prevent data loss caused by split-brain. If you are finding that you need to do this then there is something very very wrong with how you are managing your Elasticsearch cluster.

Ultimately it's your data so it's your call, but I cannot overstate to future readers of this thread how important it is not to follow the same path unless they also have no regard for the integrity of their data.

1 Like

To be honest, between the 2 solutions:

  • delete all data (100% chance of data loss)
  • a fix (X % chance of data loss)

I am pretty sure everyone will try the fix before removing data :stuck_out_tongue:

I understand elasticsearch is not a data-base, but only a super indexer/searcher service and its data must be always backed up somewhere. But in some cases, we are using it as primary data source (for log analysis) so we want something really reliable.

A correctly-orchestrated cluster needs neither of these solutions. If you are getting into situations where you end up needing to do anything like this then you're doing something wrong.

2 Likes

Absolutely, by instance yesterday, one guy have removed accidentaly the Helm release "master" with PVCs, then re-create it, hence different "cluster UUID".

Errors can happen, from humans or machines, that why software provide recovery disaster tools or documentation (like the awesome elasticsearch-node bin).

Also, I can see there is an API to import dangling indices (I guess I am not the first guy to do something wrong :stuck_out_tongue: ):

So I dont have any data loss ! et voilà

The fact that you are required to pass ?accept_data_loss=true to that API should tell you that this is completely false.

Yes, things go wrong, that's why you must set up snapshots.

Dangerous tools like elasticsearch-node and the dangling indices API exist for cases where things go so badly wrong that data loss is inevitable. They are not awesome at all, they are a last resort for when things are desperately broken. They don't really "fix" anything and if you use them as a matter of course then you will eventually lose data as a result.

1 Like