Master not discovered error


I created a cluster contains 2 data/master nodes and 3 master-only nodes on AWS EC2.
Due to the bootstrap feature, I specified two data nodes ip for luster.initial_master_nodes.

After cluster started successfuly, I shut them down.
When I restart the services, I got below error message, saying master not discovered.

What I don't understand is in the error log, it says must discover two nodes with ip 54 and 249, and it also says 'have discovered' 54 and 249. What does it mean? Thanks in advance.

[2019-09-04T18:13:06,436][WARN ][o.e.c.c.ClusterFormationFailureHelper] [265c11851f2e] master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [,] to bootstrap a cluster: have discovered [{a85f4d457006}{07GysAOyT2m_VYcuKX1f-w}{Fch5OMfzQ6-MGl8QC5Bjhg}{}{}, {cbc7d32e43be}{KKNNl2sJReuqC_Nkegop0g}{ScySSIWOTI6biCdHAOkFHg}{}{}, {532457f50a27}{OQlMkdEZQ9ecPKyORE1x5w}{-Gfo9t75QiSOdy9oxRGb4w}{}{}{aws_availability_zone=us-east-1c}, {035580c7e862}{OdgujABbRcWS-PPN3wrtoA}{OMrQ_3HlReWmBizc1y-YKA}{}{}{aws_availability_zone=us-east-1b}]; discovery will continue using [,,,,,,,,,] from hosts providers and [{265c11851f2e}{5DfCY0qiQE2GnnYZaFohbg}{-5s4Ai0nSFuTGYPSfTx8RQ}{}{}] from last-known cluster state; node term 0, last-accepted version 0 in term 0

There is something wrong with your configuration if you get this message on a restart:

this node has not previously joined a bootstrapped (v7+) cluster

From your description it sounds like these nodes had previously joined a cluster. Are you sure that you are using persistent storage for your master nodes?

Does it work to use the node names instead of their IP addresses?

Also, are and master-eligible? Which version are you using, exactly?

Thanks for quick reply, David.

Yes, .54 and .249 are master-eligible.

The version is 7.1.1.(Actually I'm using opendistro 1.1.0 from AWS)

The other master-only nodes does not use persistent storage, the data(also master) nodes(.54, .249) does.
I tried starting data nodes first then master-only nodes, after a full shutdown, it works.

The error message I was asking is when I started master-only nodes first after full showdown. I was expecting those master-only nodes are treated as new nodes, and they will discover .54/.249 after they started and join the existing cluster. From log, they did discover .54 and .249, but didn't mention why can not join.

Ok, this doesn't work. All master eligible nodes need persistent storage.

Since we are using AWS EC2 instances, what if some master nodes terminated and we can not get the storage back? Or it will be OK as long as there are more than half of master eligible nodes with persistent data?

Another question, if the whole cluster were down, all master nodes storage lost, but the data-only nodes storage remain, is there a way to recover our data?
Thank you.

You must have persistent storage on all master-eligible nodes. Then the cluster will tolerate the loss of a minority of them.

Yes, you can restore a snapshot into a new cluster.

I worked out why you were having the original problem by the way. The node whose logs you shared isn't one of the nodes listed in cluster.initial_master_nodes so it cannot trigger the initial election (they can trigger other elections after the initial one, but we're not there yet). However the nodes that are listed in cluster.initial_master_nodes were failing to perform an election for some other reasons, that would have been described in their logs.

It is strange to have two data-and-master nodes and three master-only nodes. It is unusual to want five master-eligible nodes in your cluster. I think you should either have three dedicated master nodes and two data-only nodes, or else two data-and-master nodes and one extra master-only node. If you were using 7.3 and weren't using OpenDistro then you could make the extra master-only node a voting-only master node to ensure that it never actually becomes the master, meaning it would need less CPU and heap.

1 Like

Thanks for the explanation, that's helpful.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.