Data node using different cluster id

Hi

For some of the maintenance exercise, I had to bring down the master. In that process, I made data node as master as well. Now with new master, I am not able to get data node join master

Data node throws following error

        {"type": "server", "timestamp": "2020-08-14T10:25:51,473Z", "level": "INFO", "component": "o.e.c.c.JoinHelper", "cluster.name": "my-es", "node.name": "es-data-0", "message": "failed to join {es-master-0}{8LMhokWuSyKkPdWS3WrBRg}{SGtF0N1lQNCZoMgPcAp96g}{192.168.7.130}{192.168.7.130:9300}{lm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={es-data-0}{GUp73vUlTm6cgyKYTatD4Q}{C96McUhCSJ-53IT_kbs-NA}{192.168.8.135}{192.168.8.135:9300}{dil}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional.empty}",
    "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-master-0][192.168.7.130:9300][internal:cluster/coordination/join]",
    "Caused by: java.lang.IllegalStateException: failure when sending a validation request to node",
    "at org.elasticsearch.cluster.coordination.Coordinator$2.onFailure(Coordinator.java:514) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1130) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:244) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]",
    "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]",
    "at java.lang.Thread.run(Thread.java:830) [?:?]",
    "Caused by: org.elasticsearch.transport.RemoteTransportException: [es-data-0][192.168.8.135:9300][internal:cluster/coordination/join/validate]",
    "Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid uWqbnLXRRxOAB55anI5fwQ than local cluster uuid 28FsH0JRQ1-XDmSPZCPYDg, rejecting",
    "at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$4(JoinHelper.java:148) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]",
    "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:306) ~[?:?]",
    "at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeSystemUser(AuthorizationService.java:378) ~[?:?]",
    "at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:186) ~[?:?]",
    "at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$1(ServerTransportFilter.java:130) ~[?:?]",
    "at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:248) ~[?:?]",
    "at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$6(AuthenticationService.java:310) ~[?:?]",
    "at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:321) ~[?:?]",
    "at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:245) ~[?:?]",
    "at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:196) ~[?:?]",
    "at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:139) ~[?:?]",
    "at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:121) ~[?:?]",
    "at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:313) ~[?:?]",
    "at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]",
    "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]",
    "at java.lang.Thread.run(Thread.java:830) ~[?:?]"] }

On master I get this error

{"type": "server", "timestamp": "2020-08-14T10:20:53,860Z", "level": "WARN", "component": "o.e.c.c.Coordinator", "cluster.name": "my-es", "node.name": "es-master-0", "message": "failed to validate incoming join request from node [{es-data-0}{GUp73vUlTm6cgyKYTatD4Q}{C96McUhCSJ-53IT_kbs-NA}{192.168.8.135}{192.168.8.135:9300}{dil}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}]", "cluster.uuid": "uWqbnLXRRxOAB55anI5fwQ", "node.id": "8LMhokWuSyKkPdWS3WrBRg" ,
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-data-0][192.168.8.135:9300][internal:cluster/coordination/join/validate]",
"Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid uWqbnLXRRxOAB55anI5fwQ than local cluster uuid 28FsH0JRQ1-XDmSPZCPYDg, rejecting",
"at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$4(JoinHelper.java:148) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:306) ~[?:?]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeSystemUser(AuthorizationService.java:378) ~[?:?]",
"at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:186) ~[?:?]",
"at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$1(ServerTransportFilter.java:130) ~[?:?]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:248) ~[?:?]",
"at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$6(AuthenticationService.java:310) ~[?:?]",
"at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:321) ~[?:?]",
"at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:245) ~[?:?]",
"at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:196) ~[?:?]",
"at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:139) ~[?:?]",
"at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:121) ~[?:?]",
"at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:313) ~[?:?]",
"at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:830) [?:?]"] }

I understand the cluster id seems to be different than what data node is trying use. I have data in data node which I do not want to loose. Is there any way I can fix this?

I am running the setup in kubernetes cluster.

Thanks in advance

It looks like you've brought up a completely new master node. You can recover your data by running the original master node instead.

I do not have original master node and its storage as it got wiped out. I have preserved the storage on data node only, but master node storgage got wiped out. How to bring up original master node?

See these docs, particularly:

... if you shut down half or more of the master-eligible nodes all at the same time then the cluster will normally become unavailable. If this happens then you can bring the cluster back online by starting the removed nodes again.

You had two master-eligible nodes, and shut down one of them, which is "half or more", so this paragraph applies. The only safe way to proceed is to start the lost node again. If you no longer have this node then this cluster is dead, you'll need to start a new cluster and restore the data from a recent snapshot.

The lost node held the vitally-important cluster metadata, without which the data on your data node is meaningless.

I managed to bring up other nodes. But I feel since I added one of them to get excluded (es-master-1) and one of them got wiped out (es-master-0), I can't get the master elected. Below are the logs from three containers
es-data-o

{
    "type": "server",
    "timestamp": "2020-08-14T12:40:32,160Z",
    "level": "WARN",
    "component": "o.e.c.c.ClusterFormationFailureHelper",
    "cluster.name": "my-es",
    "node.name": "es-data-0",
    "message": "master not discovered or elected yet, an election requires a node with id [Is0eCOtyQR2Pxtn_7dTKnA], have discovered [{es-data-0}{GUp73vUlTm6cgyKYTatD4Q}{8O1znwgESr2v_JlevdDB-A}{192.168.14.75}{192.168.14.75:9300}{dilm}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}, {es-master-1}{ebVAspo0Skuakonh5IE44A}{lR8W5qR3TsWXyzQiBNdudg}{192.168.28.84}{192.168.28.84:9300}{lm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}, {es-master-2}{egfJOXSFRUKbTrZUmZE3GQ}{0yTNtoeaTwqI0qhhxpdnMA}{192.168.43.174}{192.168.43.174:9300}{lm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}] which is not a quorum; discovery will continue using [10.100.115.186:9300] from hosts providers and [{es-data-0}{GUp73vUlTm6cgyKYTatD4Q}{8O1znwgESr2v_JlevdDB-A}{192.168.14.75}{192.168.14.75:9300}{dilm}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 31, last-accepted version 17115 in term 31"
  }

es-master-2 ( which is the node that has the cluster information)

{
    "type": "server",
    "timestamp": "2020-08-14T12:40:08,488Z",
    "level": "WARN",
    "component": "o.e.c.c.ClusterFormationFailureHelper",
    "cluster.name": "my-es",
    "node.name": "es-master-2",
    "message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [Is0eCOtyQR2Pxtn_7dTKnA, ebVAspo0Skuakonh5IE44A, egfJOXSFRUKbTrZUmZE3GQ], have discovered [{es-master-2}{egfJOXSFRUKbTrZUmZE3GQ}{0yTNtoeaTwqI0qhhxpdnMA}{192.168.43.174}{192.168.43.174:9300}{lm}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}, {es-master-1}{ebVAspo0Skuakonh5IE44A}{lR8W5qR3TsWXyzQiBNdudg}{192.168.28.84}{192.168.28.84:9300}{lm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}, {es-data-0}{GUp73vUlTm6cgyKYTatD4Q}{8O1znwgESr2v_JlevdDB-A}{192.168.14.75}{192.168.14.75:9300}{dilm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [10.100.115.186:9300] from hosts providers and [{es-master-2}{egfJOXSFRUKbTrZUmZE3GQ}{0yTNtoeaTwqI0qhhxpdnMA}{192.168.43.174}{192.168.43.174:9300}{lm}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 15, last-accepted version 16710 in term 14"
  }

es-master-1

{
    "type": "server",
    "timestamp": "2020-08-14T12:40:44,258Z",
    "level": "WARN",
    "component": "o.e.c.c.ClusterFormationFailureHelper",
    "cluster.name": "my-es",
    "node.name": "es-master-1",
    "message": "master not discovered or elected yet, an election requires a node with id [Is0eCOtyQR2Pxtn_7dTKnA], have discovered [{es-master-1}{ebVAspo0Skuakonh5IE44A}{lR8W5qR3TsWXyzQiBNdudg}{192.168.28.84}{192.168.28.84:9300}{lm}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}, {es-master-2}{egfJOXSFRUKbTrZUmZE3GQ}{0yTNtoeaTwqI0qhhxpdnMA}{192.168.43.174}{192.168.43.174:9300}{lm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}, {es-data-0}{GUp73vUlTm6cgyKYTatD4Q}{8O1znwgESr2v_JlevdDB-A}{192.168.14.75}{192.168.14.75:9300}{dilm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}] which is not a quorum; discovery will continue using [10.100.115.186:9300] from hosts providers and [{es-master-1}{ebVAspo0Skuakonh5IE44A}{lR8W5qR3TsWXyzQiBNdudg}{192.168.28.84}{192.168.28.84:9300}{lm}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 23, last-accepted version 16852 in term 23"
  }

Is there a way I can remove the voting exlusion that I had put in on es-master-1, when I was reducing the master nodes?

No, as I said before, the only way to resurrect this cluster is to bring back the missing node. You need to satisfy all of these constraints:

an election requires a node with id [Is0eCOtyQR2Pxtn_7dTKnA]
an election requires at least 2 nodes with ids from [Is0eCOtyQR2Pxtn_7dTKnA, ebVAspo0Skuakonh5IE44A, egfJOXSFRUKbTrZUmZE3GQ]
an election requires a node with id [Is0eCOtyQR2Pxtn_7dTKnA]

Since the node with ID Is0eCOtyQR2Pxtn_7dTKnA is missing, you are stuck, since that node was the only one that held your cluster metadata. Without the cluster metadata, the data on the other nodes is meaningless. If you can't bring that node back, you'll need to start a new cluster and restore the data from a recent snapshot.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.