ElasticSearch not able to discover Master nodes

Hi all,

We were running an HA Elastic cluster. 3 nodes to be precise. However, this morning, without me being aware, Kubernetes was upgraded and that nodes were restarted, resulting in the cluster being in in-consistent state. This is the error we get

{"type": "server", "timestamp": "2020-06-10T13:05:29,995Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-ha-cluster", "node.name": "es-ha-cluster-es-nodes-0", "message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [pdgmDYn8SZil8PC54Niqjw, jEEaIPPERBGCoZcuBH3mWw, 2OBhIOXrRfq0pvPVwpv-TA], have discovered [{es-ha-cluster-es-nodes-0}{2OBhIOXrRfq0pvPVwpv-TA}{RebLHrjjT4KUOCLe0Eu7dw}{10.244.3.23}{10.244.3.23:9300}{dilm}{ml.machine_memory=30064771072, xpack.installed=true, ml.max_open_jobs=20}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, 10.244.1.21:9300, 10.244.4.26:9300] from hosts providers and [{es-ha-cluster-es-nodes-0}{2OBhIOXrRfq0pvPVwpv-TA}{RebLHrjjT4KUOCLe0Eu7dw}{10.244.3.23}{10.244.3.23:9300}{dilm}{ml.machine_memory=30064771072, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 15, last-accepted version 12945 in term 15" }
{"type": "server", "timestamp": "2020-06-10T13:05:36,608Z", "level": "ERROR", "component": "o.e.x.s.a.e.NativeUsersStore", "cluster.name": "es-ha-cluster", "node.name": "es-ha-cluster-es-nodes-0", "message": "security index is unavailable. short circuiting retrieval of user [aks-ingest]" }

The cluster has been in this state since morning. I have been trying to remove nodes using the official documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/add-elasticsearch-nodes.html

But API is not working and I get the following error:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

Can anyone please help me in this matter ?

Thanks !

  • What is the .yml of the 3 nodes?
  • What is the response of the localhost:9200/_cluster/health API?

This error can happen if to many nodes go down and no new master was elected. The election process needs at least 2 nodes. If all but 1 node go down there is a possibility that this node is not elected as a master and therefor => master_not_dscovered.

---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: es-ha-cluster
spec:
  version: 7.6.2
  image: elasticsearch:7.6.2
  http:
    tls:
      selfSignedCertificate:
        disabled: true
  nodeSets:
  - name: nodes
    count: 3
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false
    podTemplate:
      spec:
        containers:
          - name: elasticsearch
            env:
              - name: ES_JAVA_OPTS
                value: "-Xms10g -Xmx10g"
            resources:
              limits:
                cpu: "4"
                memory: 28Gi
              requests:
                cpu: "2"
                memory: 20Gi
            volumeMounts:
              - name: elasticsearch-data
                mountPath: /usr/share/elasticsearch/data
        initContainers:
          - name: chown-data-volumes
            command: ["sh", "-c", "chown elasticsearch:elasticsearch /usr/share/elasticsearch/data"]
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 256Gi
        storageClassName: eck-storage

Health state:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}