After restarting the master node, data and client nodes cannot discover the master

daniela09 · May 31, 2023, 12:55pm

Hi, I am using elasticsearch cluster (8.7.0) on Kubernetes, I have 1 master, 1 client and 3 data nodes.
After the restart of my master node, the other nodes cannot discover the master again.

This is in the log of the data node:

{"@timestamp":"2023-05-31T12:38:36.905Z", "log.level": "WARN", "message":"master not discovered yet: have discovered [{elasticsearch-data}{XLhtbNSQQOG4F6a-luPJ7Q}{QNGqb2xxTzagpLE6BPQhYA}{elasticsearch-data}{192.168.217.75}{192.168.217.75:9300}{d}{8.7.0}, {elasticsearch-master}{9hvRUjvsTXeVF-NIEwbZQA}{2hbkEtdwTIy7DBAYI1AxOw}{elasticsearch-master}{192.168.247.28}{192.168.247.28:9300}{m}{8.7.0}]; discovery will continue using [10.102.87.156:9300] from hosts providers and [{elasticsearch-master}{Jwgz0LUATzyyQk4qvU292g}{OSmdhhayQguwVZwmj3c3fw}{elasticsearch-master}{192.168.84.131}{192.168.84.131:9300}{m}{8.7.0}] from last-known cluster state; node term 1, last-accepted version 71 in term 1; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.7/discovery-troubleshooting.html", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-data][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.cluster.uuid":"hKpt2rS0TbG1z2PjWRtVnQ","elasticsearch.node.id":"XLhtbNSQQOG4F6a-luPJ7Q","elasticsearch.node.name":"elasticsearch-data","elasticsearch.cluster.name":"elasticsearch"}
{"@timestamp":"2023-05-31T12:38:45.163Z", "log.level": "WARN", "message":"monitoring execution failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-data][write][T#6]","log.logger":"org.elasticsearch.xpack.monitoring.MonitoringService","elasticsearch.cluster.uuid":"hKpt2rS0TbG1z2PjWRtVnQ","elasticsearch.node.id":"XLhtbNSQQOG4F6a-luPJ7Q","elasticsearch.node.name":"elasticsearch-data","elasticsearch.cluster.name":"elasticsearch","error.type":"org.elasticsearch.xpack.monitoring.exporter.ExportException","error.message":"failed to flush export bulks","error.stack_trace":"org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks\n\tat org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:110)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:175)\n\tat org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$1(LocalBulk.java:114)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:175)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:38)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.client.internal.node.NodeClient$SafelyWrappedActionListener.onFailure(NodeClient.java:170)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.tasks.TaskManager$1.onFailure(TaskManager.java:218)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:38)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:97)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:97)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener$RunBeforeActionListener.onFailure(ActionListener.java:450)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:92)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.retry(TransportBulkAction.java:685)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:672)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:541)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:891)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1589)\nCaused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulk [default_local]\n\t... 20 more\nCaused by: org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:177)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:668)\n\t... 8 more\n"}
{"@timestamp":"2023-05-31T12:38:45.423Z", "log.level": "WARN", "message":"failed to connect to {elasticsearch-master}{Jwgz0LUATzyyQk4qvU292g}{OSmdhhayQguwVZwmj3c3fw}{elasticsearch-master}{192.168.84.131}{192.168.84.131:9300}{m}{8.7.0}{xpack.installed=true} (tried [67] times)", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-data][generic][T#3]","log.logger":"org.elasticsearch.cluster.NodeConnectionsService","elasticsearch.cluster.uuid":"hKpt2rS0TbG1z2PjWRtVnQ","elasticsearch.node.id":"XLhtbNSQQOG4F6a-luPJ7Q","elasticsearch.node.name":"elasticsearch-data","elasticsearch.cluster.name":"elasticsearch","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[elasticsearch-master][192.168.84.131:9300] connect_exception","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [elasticsearch-master][192.168.84.131:9300] connect_exception\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1151)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:502)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:111)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:100)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:149)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:139)\n\tat org.elasticsearch.transport.netty4@8.7.0/org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:62)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)\n\tat org.elasticsearch.security@8.7.0/org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport$ClientSslHandlerInitializer.lambda$connect$1(SecurityNetty4Transport.java:267)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)\n\tat io.netty.transport@4.1.86.Final/io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:262)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)\n\tat io.netty.transport@4.1.86.Final/io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.common@4.1.86.Final/io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat java.base/java.lang.Thread.run(Thread.java:1589)\nCaused by: org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:80)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:72)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$notifyListenerDirectly$0(ListenableFuture.java:111)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:499)\n\t... 30 more\nCaused by: java.util.concurrent.ExecutionException: io.netty.channel.ConnectTimeoutException: connection timed out: 192.168.84.131/192.168.84.131:9300\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:231)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:53)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:65)\n\t... 32 more\nCaused by: io.netty.channel.ConnectTimeoutException: connection timed out: 192.168.84.131/192.168.84.131:9300\n\tat io.netty.transport@4.1.86.Final/io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261)\n\t... 9 more\n"}

This is in the log of the master node:

{"@timestamp":"2023-05-31T12:52:25.306Z", "log.level": "WARN", "message":"address [10.102.87.156:9300], node [null], requesting [false] discovery result: [elasticsearch-master][192.168.247.28:9300] successfully discovered local node {elasticsearch-master}{9hvRUjvsTXeVF-NIEwbZQA}{2hbkEtdwTIy7DBAYI1AxOw}{elasticsearch-master}{192.168.247.28}{192.168.247.28:9300}{m}{8.7.0} at [10.102.87.156:9300]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-master][generic][T#2]","log.logger":"org.elasticsearch.discovery.PeerFinder","elasticsearch.node.name":"elasticsearch-master","elasticsearch.cluster.name":"elasticsearch"}

This is in the log of the client node:

{"@timestamp":"2023-05-31T12:53:07.147Z", "log.level": "WARN", "message":"master not discovered yet: have discovered [{elasticsearch-client}{PSXY08nNT1S8KWVH6qKBaA}{OAPLtfaUQXmTN0TriZl-7A}{elasticsearch-client}{192.168.84.134}{192.168.84.134:9300}{8.7.0}, {elasticsearch-master}{9hvRUjvsTXeVF-NIEwbZQA}{2hbkEtdwTIy7DBAYI1AxOw}{elasticsearch-master}{192.168.247.28}{192.168.247.28:9300}{m}{8.7.0}]; discovery will continue using [10.102.87.156:9300] from hosts providers and [{elasticsearch-master}{Jwgz0LUATzyyQk4qvU292g}{OSmdhhayQguwVZwmj3c3fw}{elasticsearch-master}{192.168.84.131}{192.168.84.131:9300}{m}{8.7.0}] from last-known cluster state; node term 1, last-accepted version 71 in term 1; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.7/discovery-troubleshooting.html", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-client][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.cluster.uuid":"hKpt2rS0TbG1z2PjWRtVnQ","elasticsearch.node.id":"PSXY08nNT1S8KWVH6qKBaA","elasticsearch.node.name":"elasticsearch-client","elasticsearch.cluster.name":"elasticsearch"}

Can you please help me I am trying to solve this for 3 days, what can be the problem, why is this happening and how can I solve this?

Christian_Dahlqvist · May 31, 2023, 12:57pm

Are both master and data nodes backed by persistent storage?

daniela09 · May 31, 2023, 1:16pm

Yes, they are.

This is in the deployment of my master node:

volumeMounts:
...
 - mountPath: /data
          name: elasticsearch-master-pvc
...

volumes:
...
      - name: elasticsearch-master-pvc
        persistentVolumeClaim:
          claimName: elasticsearch-master-pvc
...

And this is in the statefulset of my data nodes:

volumeMounts:
...
        - mountPath: /data/db
          name: elasticsearch-data
...

  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: elasticsearch-data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      storageClassName: glusterfs-replication-none
      volumeMode: Filesystem

Is this okay?

daniela09 · May 31, 2023, 1:48pm

This is also in the logs of the master node:

{"@timestamp":"2023-05-31T13:46:21.626Z", "log.level": "WARN", "message":"master not discovered yet, this node has not previously joined a bootstrapped cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{elasticsearch-master}{9hvRUjvsTXeVF-NIEwbZQA}{2hbkEtdwTIy7DBAYI1AxOw}{elasticsearch-master}{192.168.247.28}{192.168.247.28:9300}{m}{8.7.0}]; discovery will continue using [10.102.87.156:9300] from hosts providers and [{elasticsearch-master}{9hvRUjvsTXeVF-NIEwbZQA}{2hbkEtdwTIy7DBAYI1AxOw}{elasticsearch-master}{192.168.247.28}{192.168.247.28:9300}{m}{8.7.0}] from last-known cluster state; node term 0, last-accepted version 0 in term 0; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.7/discovery-troubleshooting.html", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-master][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.node.name":"elasticsearch-master","elasticsearch.cluster.name":"elasticsearch"}

DavidTurner · May 31, 2023, 4:04pm

That means your master node's data path did not persist across its restart, which is fatal to the cluster. You will need to build your cluster again from scratch and restore any missing data from a recent snapshot.

See these docs for more information:

The contents of the path.data directory must persist across restarts, because this is where your data is stored.

daniela09 · June 14, 2023, 8:48am

I don't have path.data in my elasticsearch.yaml file, can that be the problem why my data path did not persist across its restart?
I only have persistent volume which is mount in my deployment:

volumeMounts:
...
 - mountPath: /data
          name: elasticsearch-master-pvc
...

volumes:
...
      - name: elasticsearch-master-pvc
        persistentVolumeClaim:
          claimName: elasticsearch-master-pvc
...

And this is my elasticsearch.yaml file:

    cluster.name: ${CLUSTER_NAME}
    node.name: ${NODE_NAME}
    discovery.seed_hosts: ${NODE_LIST}
    network.host: 0.0.0.0
    node.roles: ["data"]
    xpack.monitoring.collection.enabled: true
    xpack.security.enabled: true
    xpack.security.transport.ssl.enabled: true
    xpack.security.transport.ssl.verification_mode: certificate
    xpack.security.transport.ssl.keystore.path: /usr/share/elasticsearch/config/certs/elastic-certificates.p12
    xpack.security.transport.ssl.truststore.path: /usr/share/elasticsearch/config/certs/elastic-certificates.p12
    xpack.security.http.ssl.enabled: false
    xpack.security.http.ssl.truststore.path: /usr/share/elasticsearch/config/certs/elastic-certificates.p12
    xpack.security.http.ssl.keystore.path: /usr/share/elasticsearch/config/certs/elastic-certificates.p12

DavidTurner · June 14, 2023, 9:29am

Sounds like a plausible explanation indeed. Is Elasticsearch writing anything to that PVC? Definitely best to set path.data anyway.

daniela09 · June 14, 2023, 12:58pm

And what should the path be to path.data?

daniela09 · June 14, 2023, 1:05pm

/var/lib/elasticsearch?

DavidTurner · June 14, 2023, 1:10pm

Anywhere that persists across restarts will do. You mentioned mountPath: /data above which suggests that anything under /data will work.

daniela09 · June 14, 2023, 1:11pm

Will /data/elasticsearch work?

system · July 12, 2023, 1:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Client node unable to discover master node after master restarts and gets new IP Elasticsearch	3	847	August 16, 2018
Master nodes do not detect the other masters after service restart Elasticsearch	10	5078	August 21, 2019
ElasticSearch not able to discover Master nodes Elastic Stack	3	1562	November 4, 2022
Can't find master nodes after node restart Elasticsearch	14	3561	September 14, 2020
Problem restarting cluster on 6.6.1 Elasticsearch	3	438	September 28, 2019

After restarting the master node, data and client nodes cannot discover the master

Related topics