Getting “master not discovered or elected yet” causing cluster not up in version 7.9.1

Hi, I am building a 5 nodes cluster using k8s, 3 master nodes, and 2 data nodes. When restart 3 masters orderly, got error "master not discovered or elected yet", the cluster can't work. This situation last several hours after I delete all master pods and rebuild pods.

{"type": "server", "timestamp": "2020-09-23T23:39:13,686Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "ems-search-000", "node.name": "ems-search-000-master-0", "message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [_q-piBceRT2bpdM4yFUxfw, CGCD0ENwT1O_KV8snCwlCg, 3D2zfFbRSqeRJivqWm7TSA], have discovered [{ems-search-000-master-0}{CGCD0ENwT1O_KV8snCwlCg}{RMoCPsq0QGy6rC2dWD44Yw}{192.168.76.3}{192.168.76.3:9300}{m}{xpack.installed=true, transform.node=false}, {ems-search-000-master-2}{D4i0wEiYQl2LGKzvfB47lQ}{BZAhYx1DRcSSLBEcnd2gqQ}{192.168.69.91}{192.168.69.91:9300}{m}{xpack.installed=true, transform.node=false}, {ems-search-000-master-1}{i02Las-eRnCAwHrR0Ej1_A}{j5XoXcCNR9iZm7EQ6IegBQ}{192.168.91.203}{192.168.91.203:9300}{m}{xpack.installed=true, transform.node=false}] which is not a quorum; discovery will continue using [192.168.69.91:9300] from hosts providers and [{ems-search-000-master-2}{3D2zfFbRSqeRJivqWm7TSA}{nYRzL6S_SqSmn634Q-dv9A}{192.168.82.66}{192.168.82.66:9300}{m}{xpack.installed=true, transform.node=false}, {ems-search-000-master-0}{CGCD0ENwT1O_KV8snCwlCg}{RMoCPsq0QGy6rC2dWD44Yw}{192.168.76.3}{192.168.76.3:9300}{m}{xpack.installed=true, transform.node=false}] from last-known cluster state; node term 17, last-accepted version 2822 in term 17", "cluster.uuid": "_5ixLn1CQz2EdGxe85TMTQ", "node.id": "CGCD0ENwT1O_KV8snCwlCg" }

Some config:

    - name: cluster.initial_master_nodes
        value: "elasticsearch-master-0,elasticsearch-master-1,elasticsearch-master-2,"
    - name: discovery.seed_hosts
        value: "elasticsearch-master-headless"
    - name: cluster.name
        value: "elasticsearch"

Master0, node Id: UIl8Fr4iSsGKjTy5DQaczg -> CGCD0ENwT1O_KV8snCwlCg
{"type": "server", "timestamp": "2020-09-23T23:39:13,686Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "ems-search-000", "node.name": "ems-search-000-master-0", "message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [_q-piBceRT2bpdM4yFUxfw, CGCD0ENwT1O_KV8snCwlCg, 3D2zfFbRSqeRJivqWm7TSA], have discovered [{ems-search-000-master-0}{CGCD0ENwT1O_KV8snCwlCg}{RMoCPsq0QGy6rC2dWD44Yw}{192.168.76.3}{192.168.76.3:9300}{m}{xpack.installed=true, transform.node=false}, {ems-search-000-master-2}{D4i0wEiYQl2LGKzvfB47lQ}{BZAhYx1DRcSSLBEcnd2gqQ}{192.168.69.91}{192.168.69.91:9300}{m}{xpack.installed=true, transform.node=false}, {ems-search-000-master-1}{i02Las-eRnCAwHrR0Ej1_A}{j5XoXcCNR9iZm7EQ6IegBQ}{192.168.91.203}{192.168.91.203:9300}{m}{xpack.installed=true, transform.node=false}] which is not a quorum; discovery will continue using [192.168.69.91:9300] from hosts providers and [{ems-search-000-master-2}{3D2zfFbRSqeRJivqWm7TSA}{nYRzL6S_SqSmn634Q-dv9A}{192.168.82.66}{192.168.82.66:9300}{m}{xpack.installed=true, transform.node=false}, {ems-search-000-master-0}{CGCD0ENwT1O_KV8snCwlCg}{RMoCPsq0QGy6rC2dWD44Yw}{192.168.76.3}{192.168.76.3:9300}{m}{xpack.installed=true, transform.node=false}] from last-known cluster state; node term 17, last-accepted version 2822 in term 17", "cluster.uuid": "_5ixLn1CQz2EdGxe85TMTQ", "node.id": "CGCD0ENwT1O_KV8snCwlCg" }

Master1, node ID: _q-piBceRT2bpdM4yFUxfw -> i02Las-eRnCAwHrR0Ej1_A

{"type": "server", "timestamp": "2020-09-23T23:39:22,274Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "ems-search-000", "node.name": "ems-search-000-master-1", "message": "master not discovered or elected yet, an election requires 2 nodes with ids [i02Las-eRnCAwHrR0Ej1_A, CGCD0ENwT1O_KV8snCwlCg], have discovered [{ems-search-000-master-1}{i02Las-eRnCAwHrR0Ej1_A}{j5XoXcCNR9iZm7EQ6IegBQ}{192.168.91.203}{192.168.91.203:9300}{m}{xpack.installed=true, transform.node=false},{ems-search-000-master-2}{D4i0wEiYQl2LGKzvfB47lQ}{BZAhYx1DRcSSLBEcnd2gqQ}{192.168.69.91}{192.168.69.91:9300}{m}{xpack.installed=true, transform.node=false}, {ems-search-000-master-0}{CGCD0ENwT1O_KV8snCwlCg}{RMoCPsq0QGy6rC2dWD44Yw}{192.168.76.3}{192.168.76.3:9300}{m}{xpack.installed=true, transform.node=false}] which is a quorum; discovery willcontinue using [192.168.69.91:9300, 192.168.76.3:9300] from hosts providers and [{ems-search-000-master-1}{i02Las-eRnCAwHrR0Ej1_A}{j5XoXcCNR9iZm7EQ6IegBQ}{192.168.91.203}{174.100.91.203:9300}{m}{xpack.installed=true, transform.node=false}] from last-known cluster state; node term 0, last-accepted version 0 in term 0"}

Master2, node ID: 3D2zfFbRSqeRJivqWm7TSA -> D4i0wEiYQl2LGKzvfB47lQ

{"type": "server", "timestamp": "2020-09-23T23:36:40,475Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "ems-search-000", "node.name": "ems-search-000-master-2", "message": "master not discovered or elected yet, an election requires 2 nodes with ids [D4i0wEiYQl2LGKzvfB47lQ, CGCD0ENwT1O_KV8snCwlCg], have discovered [{ems-search-000-master-2}{D4i0wEiYQl2LGKzvfB47lQ}{BZAhYx1DRcSSLBEcnd2gqQ}{192.168.69.91}{192.168.69.91:9300}{m}{xpack.installed=true, transform.node=false}, {ems-search-000-master-0}{CGCD0ENwT1O_KV8snCwlCg}{RMoCPsq0QGy6rC2dWD44Yw}{192.168.76.3}{192.168.76.3:9300}{m}{xpack.installed=true, transform.node=false}] which is a quorum; discovery will continue using [192.168.76.3:9300] from hosts providers and [{ems-search-000-master-2}{D4i0wEiYQl2LGKzvfB47lQ}{BZAhYx1DRcSSLBEcnd2gqQ}{192.168.69.91}{192.168.69.91:9300}{m}{xpack.installed=true, transform.node=false}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }

Election ids for each master nodes:
master0: [_q-piBceRT2bpdM4yFUxfw, CGCD0ENwT1O_KV8snCwlCg, 3D2zfFbRSqeRJivqWm7TSA]
master1: [i02Las-eRnCAwHrR0Ej1_A, CGCD0ENwT1O_KV8snCwlCg]
master2: [D4i0wEiYQl2LGKzvfB47lQ, CGCD0ENwT1O_KV8snCwlCg]

From the log found nodes ids in each master node are not the same, ids of Master1 are the ids before restarted, and "master not discovered or elected yet" logs print all the time until I delete all pods and start new one. Why will this happen and how can I avoid this happen again?
And also found master0 and master2 do not receive 'added master1' messages, and master0 and master1 do not receive 'added master2' messages, How to solve this?

Are you using presistent storage? If so, It's possible the data directories are not getting cleaned up...

I don't use persistent storage, I am using helm and k8s, and do not mount volume to master node.

Your Config

From the log

{"type": "server", "timestamp": "2020-09-23T23:39:13,686Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "ems-search-000", "node.name": "ems-search-000-master-0","message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from

Note

"node.name": "ems-search-000-master-0"

Please read this carefully.
https://www.elastic.co/guide/en/elasticsearch/reference/current/discovery-settings.html

The initial master nodes should be identified by their node.name , which defaults to their hostname. Make sure that the value in cluster.initial_master_nodes matches the node.name exactly. If you use a fully-qualified domain name such as master-node-a.example.com for your node names then you must use the fully-qualified name in this list; conversely if node.name is a bare hostname without any trailing qualifiers then you must also omit the trailing qualifiers in cluster.initial_master_nodes .

so your should be something like, also make sure your host names are correct

    - name: cluster.initial_master_nodes
        value: "ems-search-000-master-0,ems-search-000-master-1,ems-search-000-master-2,"
    - name: discovery.seed_hosts
        value: "elasticsearch-master-headless"
    - name: cluster.name
        value: "elasticsearch"

Thanks, I will try it.

Double check the config, found pasted wrong config.
The config is

        - name: cluster.initial_master_nodes
          value: ems-search-000-master-0,ems-search-000-master-1,ems-search-000-master-2,
        - name: discovery.seed_hosts
          value: ems-search-000-master-headless
        - name: cluster.name
          value: ems-search-000
        - name: network.host
          value: 0.0.0.0

Did that work?

The setting works well after delete all pods and start new cluster.
But I can't reproduce this issue. Not sure what cause this.
From the log, found that some nodes missed other nodes 'added' messages, and there were 2 nodes restart at the same time.
As the below log, master-0 got all three nodes info, but it only elected master from [_q-piBceRT2bpdM4yFUxfw, CGCD0ENwT1O_KV8snCwlCg, 3D2zfFbRSqeRJivqWm7TSA]. It used last-known cluster state(_q-piBceRT2bpdM4yFUxfw,3D2zfFbRSqeRJivqWm7TSA), the new ids are (D4i0wEiYQl2LGKzvfB47lQ,i02Las-eRnCAwHrR0Ej1_A), why it didn't update the id info though their node name are the same? Any way to solve this?

{"type": "server", "timestamp": "2020-09-23T23:59:53,775Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "ems-search-000", "node.name": "ems-search-000-master-0", "message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [_q-piBceRT2bpdM4yFUxfw, CGCD0ENwT1O_KV8snCwlCg, 3D2zfFbRSqeRJivqWm7TSA], have discovered [{ems-search-000-master-0}{CGCD0ENwT1O_KV8snCwlCg}{RMoCPsq0QGy6rC2dWD44Yw}{174.100.76.3}{174.100.76.3:9300}{m}{xpack.installed=true, transform.node=false}, {ems-search-000-master-2}{D4i0wEiYQl2LGKzvfB47lQ}{BZAhYx1DRcSSLBEcnd2gqQ}{174.100.69.91}{174.100.69.91:9300}{m}{xpack.installed=true, transform.node=false}, {ems-search-000-master-1}{i02Las-eRnCAwHrR0Ej1_A}{j5XoXcCNR9iZm7EQ6IegBQ}{174.100.91.203}{174.100.91.203:9300}{m}{xpack.installed=true, transform.node=false}] which is not a quorum; discovery will continue using [174.100.91.203:9300, 174.100.69.91:9300] from hosts providers and [{ems-search-000-master-2}{3D2zfFbRSqeRJivqWm7TSA}{nYRzL6S_SqSmn634Q-dv9A}{174.100.82.66}{174.100.82.66:9300}{m}{xpack.installed=true, transform.node=false}, {ems-search-000-master-0}{CGCD0ENwT1O_KV8snCwlCg}{RMoCPsq0QGy6rC2dWD44Yw}{174.100.76.3}{174.100.76.3:9300}{m}{xpack.installed=true, transform.node=false}] from last-known cluster state; node term 17, last-accepted version 2822 in term 17", "cluster.uuid": "_5ixLn1CQz2EdGxe85TMTQ", "node.id": "CGCD0ENwT1O_KV8snCwlCg" }

Apologies I see from above.... not using persistent storage.

I am not sure what the issue is..

It works? I am unclear on the issue you are trying to solve... is the cluster working? or are you just concerned about log messages... the cluster / nodes will emit many logs until all nodes up , master elastic and quorum established.

Sorry about the confused description.
We are using helm chart from here https://github.com/elastic/helm-charts/tree/master/elasticsearch.
The setting work fine at the beginning. But unexpected issue happened.
As the first comment described, the cluster down after running several days.
All 5 pods were running at 5 different instances, k8s cluster rescheduled 3 master nodes to 3 different instances, during this process, master2 had restarted twice, also found master0 and master2 do not receive 'added master1' messages, and master0 and master1 do not receive 'added master2' messages, and finally got error "master not discovered or elected yet" all the time, and details are described in first comment. The elasticsearch cluster can't elect a master node for several hours. We solve this issue by shutting down all master nodes, and restart new one.
But I need to find out what cause this and avoid this happen again.

Hi, I am building a 5 nodes cluster using k8s, 3 master nodes, and 2 data nodes. When restart 3 masters orderly, got error "master not discovered or elected yet", the cluster can't work. This situation last several hours after I delete all master pods and rebuild pods.

Curious if you have considered using our operator.

We haven't try this. At first we are using Amazon Elasticsearch service, but it has many operation limitations, so we want to deploy by ourself.
I think this issue not caused by the deployment, it might cause by the master election process.
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-quorums.html

To be sure that the cluster remains available you must not stop half or more of the nodes in the voting configuration at the same time .

We have 2 master nodes offline at the same time, it will not recover after the 2 nodes came back?

I am not sure I understand the issue agree with @dadoonet try Elastic Cloud or perhaps try the operater.

BTW did you look at Cloud by Elastic, also available if needed from AWS Marketplace ?

Cloud by elastic is one way to have access to all features, all managed by us. Think about what is there yet like Security, Monitoring, Reporting, SQL, Canvas, Maps UI, Alerting and built-in solutions named Observability, Security, Enterprise Search and what is coming next :slight_smile: ...

Our op team using helm to manage k8s, to use Elastic Clound need more investigation.
But I still think the deploy using k8s or Elastic Clound don't make any difference for this issue, I think the election has some issue, it can't update node info. I will try if can reproduce it. It will be best if you can give some advice.

It will as long as you bring back the same two nodes. You can't use two completely new nodes instead, but that's what your log message indicates has happened:

an election requires at least 2 nodes with ids from [_q-piBceRT2bpdM4yFUxfw, CGCD0ENwT1O_KV8snCwlCg, 3D2zfFbRSqeRJivqWm7TSA], have discovered [{ems-search-000-master-0}{CGCD0ENwT1O_KV8snCwlCg}, {ems-search-000-master-2}{D4i0wEiYQl2LGKzvfB47lQ}{BZAhYx1DRcSSLBEcnd2gqQ}, {ems-search-000-master-1}{i02Las-eRnCAwHrR0Ej1_A}]

Note that ems-search-000-master-2 and ems-search-000-master-1 have brand-new node IDs, indicating that they're brand-new nodes.

The reason is that the cluster metadata is stored on a majority of the master-eligible nodes, so might not be there on ems-search-000-master-0. Since the other two nodes are new, they don't contain the cluster metadata either. As Elasticsearch cannot find a good copy of the cluster metadata any more, it won't form a cluster.

The 2 nodes were offline by unexpected issue, may be hardware issue.
It's there a way to recover the cluster? clear the cluster metadata? Or how can I bring back the same two nodes?
If I start more than 3 nodes, may be 5 nodes, and then close 2 nodes, will the master election become normal?

No, if you permanently lost a majority of the master nodes then you have lost the cluster metadata, there's no way to recover it. A hardware issue won't affect multiple nodes at once so you should be safe from that.

For this case, there is no way to recover?
Our operation is remove all master nodes at the same time, and start new one, this will cause the cluster id change, and then the data node can't join the cluster, and finally we need to rebuild the index. It's a better way to solve this?

All master and data nodes require persistent storage so that nodes can be restarted without losing data. Without that you are in trouble as you at least in newer versions can not replace nodes the way you are describing without losing all the data. I have not used the helm chart but would hope this would enforce persistent storage.

As stated in the docs:

All master-eligible nodes, including voting-only nodes, are on the critical path for publishing cluster state updates. Because of this, these nodes require reasonably fast persistent storage and a reliable, low-latency network connection to the rest of the cluster. If you add a tiebreaker node in a third independent zone then you must make sure it has adequate resources and good connectivity to the rest of the cluster.