After restarting the master node, data and client nodes cannot discover the master

Hi, I am using elasticsearch cluster (8.7.0) on Kubernetes, I have 1 master, 1 client and 3 data nodes.
After the restart of my master node, the other nodes cannot discover the master again.

This is in the log of the data node:

{"@timestamp":"2023-05-31T12:38:36.905Z", "log.level": "WARN", "message":"master not discovered yet: have discovered [{elasticsearch-data}{XLhtbNSQQOG4F6a-luPJ7Q}{QNGqb2xxTzagpLE6BPQhYA}{elasticsearch-data}{192.168.217.75}{192.168.217.75:9300}{d}{8.7.0}, {elasticsearch-master}{9hvRUjvsTXeVF-NIEwbZQA}{2hbkEtdwTIy7DBAYI1AxOw}{elasticsearch-master}{192.168.247.28}{192.168.247.28:9300}{m}{8.7.0}]; discovery will continue using [10.102.87.156:9300] from hosts providers and [{elasticsearch-master}{Jwgz0LUATzyyQk4qvU292g}{OSmdhhayQguwVZwmj3c3fw}{elasticsearch-master}{192.168.84.131}{192.168.84.131:9300}{m}{8.7.0}] from last-known cluster state; node term 1, last-accepted version 71 in term 1; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.7/discovery-troubleshooting.html", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-data][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.cluster.uuid":"hKpt2rS0TbG1z2PjWRtVnQ","elasticsearch.node.id":"XLhtbNSQQOG4F6a-luPJ7Q","elasticsearch.node.name":"elasticsearch-data","elasticsearch.cluster.name":"elasticsearch"}
{"@timestamp":"2023-05-31T12:38:45.163Z", "log.level": "WARN", "message":"monitoring execution failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-data][write][T#6]","log.logger":"org.elasticsearch.xpack.monitoring.MonitoringService","elasticsearch.cluster.uuid":"hKpt2rS0TbG1z2PjWRtVnQ","elasticsearch.node.id":"XLhtbNSQQOG4F6a-luPJ7Q","elasticsearch.node.name":"elasticsearch-data","elasticsearch.cluster.name":"elasticsearch","error.type":"org.elasticsearch.xpack.monitoring.exporter.ExportException","error.message":"failed to flush export bulks","error.stack_trace":"org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks\n\tat org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:110)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:175)\n\tat org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$1(LocalBulk.java:114)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:175)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:38)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.client.internal.node.NodeClient$SafelyWrappedActionListener.onFailure(NodeClient.java:170)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.tasks.TaskManager$1.onFailure(TaskManager.java:218)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:38)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:97)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:97)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener$RunBeforeActionListener.onFailure(ActionListener.java:450)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:92)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.retry(TransportBulkAction.java:685)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:672)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:541)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:891)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1589)\nCaused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulk [default_local]\n\t... 20 more\nCaused by: org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:177)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:668)\n\t... 8 more\n"}
{"@timestamp":"2023-05-31T12:38:45.423Z", "log.level": "WARN", "message":"failed to connect to {elasticsearch-master}{Jwgz0LUATzyyQk4qvU292g}{OSmdhhayQguwVZwmj3c3fw}{elasticsearch-master}{192.168.84.131}{192.168.84.131:9300}{m}{8.7.0}{xpack.installed=true} (tried [67] times)", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-data][generic][T#3]","log.logger":"org.elasticsearch.cluster.NodeConnectionsService","elasticsearch.cluster.uuid":"hKpt2rS0TbG1z2PjWRtVnQ","elasticsearch.node.id":"XLhtbNSQQOG4F6a-luPJ7Q","elasticsearch.node.name":"elasticsearch-data","elasticsearch.cluster.name":"elasticsearch","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[elasticsearch-master][192.168.84.131:9300] connect_exception","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [elasticsearch-master][192.168.84.131:9300] connect_exception\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1151)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:502)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:111)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:100)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:149)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:139)\n\tat org.elasticsearch.transport.netty4@8.7.0/org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:62)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)\n\tat org.elasticsearch.security@8.7.0/org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport$ClientSslHandlerInitializer.lambda$connect$1(SecurityNetty4Transport.java:267)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)\n\tat io.netty.transport@4.1.86.Final/io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:262)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)\n\tat io.netty.transport@4.1.86.Final/io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)\n\tat io.netty.common@4.1.86.Final/io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.common@4.1.86.Final/io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat java.base/java.lang.Thread.run(Thread.java:1589)\nCaused by: org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:80)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:72)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$notifyListenerDirectly$0(ListenableFuture.java:111)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:499)\n\t... 30 more\nCaused by: java.util.concurrent.ExecutionException: io.netty.channel.ConnectTimeoutException: connection timed out: 192.168.84.131/192.168.84.131:9300\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:231)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:53)\n\tat org.elasticsearch.server@8.7.0/org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:65)\n\t... 32 more\nCaused by: io.netty.channel.ConnectTimeoutException: connection timed out: 192.168.84.131/192.168.84.131:9300\n\tat io.netty.transport@4.1.86.Final/io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261)\n\t... 9 more\n"}

This is in the log of the master node:

{"@timestamp":"2023-05-31T12:52:25.306Z", "log.level": "WARN", "message":"address [10.102.87.156:9300], node [null], requesting [false] discovery result: [elasticsearch-master][192.168.247.28:9300] successfully discovered local node {elasticsearch-master}{9hvRUjvsTXeVF-NIEwbZQA}{2hbkEtdwTIy7DBAYI1AxOw}{elasticsearch-master}{192.168.247.28}{192.168.247.28:9300}{m}{8.7.0} at [10.102.87.156:9300]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-master][generic][T#2]","log.logger":"org.elasticsearch.discovery.PeerFinder","elasticsearch.node.name":"elasticsearch-master","elasticsearch.cluster.name":"elasticsearch"}

This is in the log of the client node:

{"@timestamp":"2023-05-31T12:53:07.147Z", "log.level": "WARN", "message":"master not discovered yet: have discovered [{elasticsearch-client}{PSXY08nNT1S8KWVH6qKBaA}{OAPLtfaUQXmTN0TriZl-7A}{elasticsearch-client}{192.168.84.134}{192.168.84.134:9300}{8.7.0}, {elasticsearch-master}{9hvRUjvsTXeVF-NIEwbZQA}{2hbkEtdwTIy7DBAYI1AxOw}{elasticsearch-master}{192.168.247.28}{192.168.247.28:9300}{m}{8.7.0}]; discovery will continue using [10.102.87.156:9300] from hosts providers and [{elasticsearch-master}{Jwgz0LUATzyyQk4qvU292g}{OSmdhhayQguwVZwmj3c3fw}{elasticsearch-master}{192.168.84.131}{192.168.84.131:9300}{m}{8.7.0}] from last-known cluster state; node term 1, last-accepted version 71 in term 1; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.7/discovery-troubleshooting.html", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-client][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.cluster.uuid":"hKpt2rS0TbG1z2PjWRtVnQ","elasticsearch.node.id":"PSXY08nNT1S8KWVH6qKBaA","elasticsearch.node.name":"elasticsearch-client","elasticsearch.cluster.name":"elasticsearch"}

Can you please help me I am trying to solve this for 3 days, what can be the problem, why is this happening and how can I solve this?

Are both master and data nodes backed by persistent storage?

Yes, they are.

This is in the deployment of my master node:

volumeMounts:
...
 - mountPath: /data
          name: elasticsearch-master-pvc
...

volumes:
...
      - name: elasticsearch-master-pvc
        persistentVolumeClaim:
          claimName: elasticsearch-master-pvc
...

And this is in the statefulset of my data nodes:

volumeMounts:
...
        - mountPath: /data/db
          name: elasticsearch-data
...

  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: elasticsearch-data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      storageClassName: glusterfs-replication-none
      volumeMode: Filesystem

Is this okay?

This is also in the logs of the master node:

{"@timestamp":"2023-05-31T13:46:21.626Z", "log.level": "WARN", "message":"master not discovered yet, this node has not previously joined a bootstrapped cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{elasticsearch-master}{9hvRUjvsTXeVF-NIEwbZQA}{2hbkEtdwTIy7DBAYI1AxOw}{elasticsearch-master}{192.168.247.28}{192.168.247.28:9300}{m}{8.7.0}]; discovery will continue using [10.102.87.156:9300] from hosts providers and [{elasticsearch-master}{9hvRUjvsTXeVF-NIEwbZQA}{2hbkEtdwTIy7DBAYI1AxOw}{elasticsearch-master}{192.168.247.28}{192.168.247.28:9300}{m}{8.7.0}] from last-known cluster state; node term 0, last-accepted version 0 in term 0; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.7/discovery-troubleshooting.html", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-master][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.node.name":"elasticsearch-master","elasticsearch.cluster.name":"elasticsearch"}

That means your master node's data path did not persist across its restart, which is fatal to the cluster. You will need to build your cluster again from scratch and restore any missing data from a recent snapshot.

See these docs for more information:

The contents of the path.data directory must persist across restarts, because this is where your data is stored.

1 Like

I don't have path.data in my elasticsearch.yaml file, can that be the problem why my data path did not persist across its restart?
I only have persistent volume which is mount in my deployment:

volumeMounts:
...
 - mountPath: /data
          name: elasticsearch-master-pvc
...

volumes:
...
      - name: elasticsearch-master-pvc
        persistentVolumeClaim:
          claimName: elasticsearch-master-pvc
...

And this is my elasticsearch.yaml file:

    cluster.name: ${CLUSTER_NAME}
    node.name: ${NODE_NAME}
    discovery.seed_hosts: ${NODE_LIST}
    network.host: 0.0.0.0
    node.roles: ["data"]
    xpack.monitoring.collection.enabled: true
    xpack.security.enabled: true
    xpack.security.transport.ssl.enabled: true
    xpack.security.transport.ssl.verification_mode: certificate
    xpack.security.transport.ssl.keystore.path: /usr/share/elasticsearch/config/certs/elastic-certificates.p12
    xpack.security.transport.ssl.truststore.path: /usr/share/elasticsearch/config/certs/elastic-certificates.p12
    xpack.security.http.ssl.enabled: false
    xpack.security.http.ssl.truststore.path: /usr/share/elasticsearch/config/certs/elastic-certificates.p12
    xpack.security.http.ssl.keystore.path: /usr/share/elasticsearch/config/certs/elastic-certificates.p12

Sounds like a plausible explanation indeed. Is Elasticsearch writing anything to that PVC? Definitely best to set path.data anyway.

And what should the path be to path.data?

/var/lib/elasticsearch?

Anywhere that persists across restarts will do. You mentioned mountPath: /data above which suggests that anything under /data will work.

Will /data/elasticsearch work?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.