Data node using different cluster id

Shambu_Pujar · August 14, 2020, 10:35am

Hi

For some of the maintenance exercise, I had to bring down the master. In that process, I made data node as master as well. Now with new master, I am not able to get data node join master

Data node throws following error

        {"type": "server", "timestamp": "2020-08-14T10:25:51,473Z", "level": "INFO", "component": "o.e.c.c.JoinHelper", "cluster.name": "my-es", "node.name": "es-data-0", "message": "failed to join {es-master-0}{8LMhokWuSyKkPdWS3WrBRg}{SGtF0N1lQNCZoMgPcAp96g}{192.168.7.130}{192.168.7.130:9300}{lm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={es-data-0}{GUp73vUlTm6cgyKYTatD4Q}{C96McUhCSJ-53IT_kbs-NA}{192.168.8.135}{192.168.8.135:9300}{dil}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional.empty}",
    "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-master-0][192.168.7.130:9300][internal:cluster/coordination/join]",
    "Caused by: java.lang.IllegalStateException: failure when sending a validation request to node",
    "at org.elasticsearch.cluster.coordination.Coordinator$2.onFailure(Coordinator.java:514) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1130) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:244) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]",
    "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]",
    "at java.lang.Thread.run(Thread.java:830) [?:?]",
    "Caused by: org.elasticsearch.transport.RemoteTransportException: [es-data-0][192.168.8.135:9300][internal:cluster/coordination/join/validate]",
    "Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid uWqbnLXRRxOAB55anI5fwQ than local cluster uuid 28FsH0JRQ1-XDmSPZCPYDg, rejecting",
    "at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$4(JoinHelper.java:148) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]",
    "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:306) ~[?:?]",
    "at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeSystemUser(AuthorizationService.java:378) ~[?:?]",
    "at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:186) ~[?:?]",
    "at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$1(ServerTransportFilter.java:130) ~[?:?]",
    "at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:248) ~[?:?]",
    "at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$6(AuthenticationService.java:310) ~[?:?]",
    "at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:321) ~[?:?]",
    "at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:245) ~[?:?]",
    "at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:196) ~[?:?]",
    "at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:139) ~[?:?]",
    "at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:121) ~[?:?]",
    "at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:313) ~[?:?]",
    "at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.2.jar:7.6.2]",
    "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]",
    "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]",
    "at java.lang.Thread.run(Thread.java:830) ~[?:?]"] }

On master I get this error

{"type": "server", "timestamp": "2020-08-14T10:20:53,860Z", "level": "WARN", "component": "o.e.c.c.Coordinator", "cluster.name": "my-es", "node.name": "es-master-0", "message": "failed to validate incoming join request from node [{es-data-0}{GUp73vUlTm6cgyKYTatD4Q}{C96McUhCSJ-53IT_kbs-NA}{192.168.8.135}{192.168.8.135:9300}{dil}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}]", "cluster.uuid": "uWqbnLXRRxOAB55anI5fwQ", "node.id": "8LMhokWuSyKkPdWS3WrBRg" ,
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-data-0][192.168.8.135:9300][internal:cluster/coordination/join/validate]",
"Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid uWqbnLXRRxOAB55anI5fwQ than local cluster uuid 28FsH0JRQ1-XDmSPZCPYDg, rejecting",
"at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$4(JoinHelper.java:148) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:306) ~[?:?]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeSystemUser(AuthorizationService.java:378) ~[?:?]",
"at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:186) ~[?:?]",
"at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$1(ServerTransportFilter.java:130) ~[?:?]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:248) ~[?:?]",
"at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$6(AuthenticationService.java:310) ~[?:?]",
"at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:321) ~[?:?]",
"at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:245) ~[?:?]",
"at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:196) ~[?:?]",
"at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:139) ~[?:?]",
"at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:121) ~[?:?]",
"at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:313) ~[?:?]",
"at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:830) [?:?]"] }

I understand the cluster id seems to be different than what data node is trying use. I have data in data node which I do not want to loose. Is there any way I can fix this?

I am running the setup in kubernetes cluster.

Thanks in advance

DavidTurner · August 14, 2020, 11:19am

It looks like you've brought up a completely new master node. You can recover your data by running the original master node instead.

Shambu_Pujar · August 14, 2020, 11:30am

I do not have original master node and its storage as it got wiped out. I have preserved the storage on data node only, but master node storgage got wiped out. How to bring up original master node?

DavidTurner · August 14, 2020, 11:37am

See these docs, particularly:

... if you shut down half or more of the master-eligible nodes all at the same time then the cluster will normally become unavailable. If this happens then you can bring the cluster back online by starting the removed nodes again.

You had two master-eligible nodes, and shut down one of them, which is "half or more", so this paragraph applies. The only safe way to proceed is to start the lost node again. If you no longer have this node then this cluster is dead, you'll need to start a new cluster and restore the data from a recent snapshot.

The lost node held the vitally-important cluster metadata, without which the data on your data node is meaningless.

Shambu_Pujar · August 14, 2020, 12:47pm

I managed to bring up other nodes. But I feel since I added one of them to get excluded (es-master-1) and one of them got wiped out (es-master-0), I can't get the master elected. Below are the logs from three containers
es-data-o

{
    "type": "server",
    "timestamp": "2020-08-14T12:40:32,160Z",
    "level": "WARN",
    "component": "o.e.c.c.ClusterFormationFailureHelper",
    "cluster.name": "my-es",
    "node.name": "es-data-0",
    "message": "master not discovered or elected yet, an election requires a node with id [Is0eCOtyQR2Pxtn_7dTKnA], have discovered [{es-data-0}{GUp73vUlTm6cgyKYTatD4Q}{8O1znwgESr2v_JlevdDB-A}{192.168.14.75}{192.168.14.75:9300}{dilm}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}, {es-master-1}{ebVAspo0Skuakonh5IE44A}{lR8W5qR3TsWXyzQiBNdudg}{192.168.28.84}{192.168.28.84:9300}{lm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}, {es-master-2}{egfJOXSFRUKbTrZUmZE3GQ}{0yTNtoeaTwqI0qhhxpdnMA}{192.168.43.174}{192.168.43.174:9300}{lm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}] which is not a quorum; discovery will continue using [10.100.115.186:9300] from hosts providers and [{es-data-0}{GUp73vUlTm6cgyKYTatD4Q}{8O1znwgESr2v_JlevdDB-A}{192.168.14.75}{192.168.14.75:9300}{dilm}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 31, last-accepted version 17115 in term 31"
  }

es-master-2 ( which is the node that has the cluster information)

{
    "type": "server",
    "timestamp": "2020-08-14T12:40:08,488Z",
    "level": "WARN",
    "component": "o.e.c.c.ClusterFormationFailureHelper",
    "cluster.name": "my-es",
    "node.name": "es-master-2",
    "message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [Is0eCOtyQR2Pxtn_7dTKnA, ebVAspo0Skuakonh5IE44A, egfJOXSFRUKbTrZUmZE3GQ], have discovered [{es-master-2}{egfJOXSFRUKbTrZUmZE3GQ}{0yTNtoeaTwqI0qhhxpdnMA}{192.168.43.174}{192.168.43.174:9300}{lm}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}, {es-master-1}{ebVAspo0Skuakonh5IE44A}{lR8W5qR3TsWXyzQiBNdudg}{192.168.28.84}{192.168.28.84:9300}{lm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}, {es-data-0}{GUp73vUlTm6cgyKYTatD4Q}{8O1znwgESr2v_JlevdDB-A}{192.168.14.75}{192.168.14.75:9300}{dilm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [10.100.115.186:9300] from hosts providers and [{es-master-2}{egfJOXSFRUKbTrZUmZE3GQ}{0yTNtoeaTwqI0qhhxpdnMA}{192.168.43.174}{192.168.43.174:9300}{lm}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 15, last-accepted version 16710 in term 14"
  }

es-master-1

{
    "type": "server",
    "timestamp": "2020-08-14T12:40:44,258Z",
    "level": "WARN",
    "component": "o.e.c.c.ClusterFormationFailureHelper",
    "cluster.name": "my-es",
    "node.name": "es-master-1",
    "message": "master not discovered or elected yet, an election requires a node with id [Is0eCOtyQR2Pxtn_7dTKnA], have discovered [{es-master-1}{ebVAspo0Skuakonh5IE44A}{lR8W5qR3TsWXyzQiBNdudg}{192.168.28.84}{192.168.28.84:9300}{lm}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}, {es-master-2}{egfJOXSFRUKbTrZUmZE3GQ}{0yTNtoeaTwqI0qhhxpdnMA}{192.168.43.174}{192.168.43.174:9300}{lm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}, {es-data-0}{GUp73vUlTm6cgyKYTatD4Q}{8O1znwgESr2v_JlevdDB-A}{192.168.14.75}{192.168.14.75:9300}{dilm}{ml.machine_memory=16818073600, ml.max_open_jobs=20, xpack.installed=true}] which is not a quorum; discovery will continue using [10.100.115.186:9300] from hosts providers and [{es-master-1}{ebVAspo0Skuakonh5IE44A}{lR8W5qR3TsWXyzQiBNdudg}{192.168.28.84}{192.168.28.84:9300}{lm}{ml.machine_memory=16818073600, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 23, last-accepted version 16852 in term 23"
  }

Is there a way I can remove the voting exlusion that I had put in on es-master-1, when I was reducing the master nodes?

DavidTurner · August 15, 2020, 8:24am

No, as I said before, the only way to resurrect this cluster is to bring back the missing node. You need to satisfy all of these constraints:

an election requires a node with id [Is0eCOtyQR2Pxtn_7dTKnA]
an election requires at least 2 nodes with ids from [Is0eCOtyQR2Pxtn_7dTKnA, ebVAspo0Skuakonh5IE44A, egfJOXSFRUKbTrZUmZE3GQ]
an election requires a node with id [Is0eCOtyQR2Pxtn_7dTKnA]

Since the node with ID Is0eCOtyQR2Pxtn_7dTKnA is missing, you are stuck, since that node was the only one that held your cluster metadata. Without the cluster metadata, the data on the other nodes is meaningless. If you can't bring that node back, you'll need to start a new cluster and restore the data from a recent snapshot.

system · September 12, 2020, 8:25am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Master election issue? Elasticsearch	4	371	July 6, 2017
Node not join the cluster so what happen about the data? Elasticsearch	4	363	July 6, 2017
Nodes fail to join cluster - potential split brain scenario Elasticsearch	11	563	July 6, 2017
Master_left and no other node elected to become master Elasticsearch	5	1099	July 6, 2017
Elastic cluster - 3nodes (1master - 2 data) Elasticsearch	21	1569	August 14, 2019

Data node using different cluster id

Related topics