Recovering from a clusterFormationFailure

sentient · October 24, 2019, 7:25pm

I'm on version 7.1.1

background: I was trimming down the cluster for 11 to 5 nodes. I made sure all data was moved to 5 nodes. I took the empty nodes out.
But I might have made a mistake when I set the flag
"discovery.zen.minimum_master_nodes": "6"
to
"discovery.zen.minimum_master_nodes": "3"

I was trying to do that when there were only 5 nodes left. And I think the master was selected on a node that I was removing

Now I have an issue to restart the cluster

[o.e.c.c.ClusterFormationFailureHelper] 

[ip-192-168-22-26] master not discovered or elected yet, an election requires at least 6 nodes with ids from 

[fOP1wlD-SgqzJcw3LozPmg, kKWKmKfGRKCBoi-H3L4hLw, dj1O8xKfRhi9tcu4SRNE5A, KlhOzJs1TWaefm09yFNx_A, Qj8gfl-_SkyYvxy74WKSgg, bMEQFfqBQvabUmKwAav8_w, XLgq5xajRo2pO3SueUMR6A, _zlsYdxOQzmfR4pVY4Z7sA, rc_i49WTT9K8YGjkVzuC6g, 5oZHQoTrSgmj8m_CvOuhQg, GO4ORcIJSqaJMRmcl2FFEw], 


have discovered 
[
{ip-192-168-12-243}{ifaOHYGgRbSuSc9lKtTWEg}{W0byVuJLRLGLkLEpuFGnmw}
{192.168.12.243}
{192.168.12.243:9300}{aws_availability_zone=us-gov-west-1a}, 

{ip-192-168-22-28}{_zlsYdxOQzmfR4pVY4Z7sA}{fr4EKHGESH-V171OnO1VDw}
{192.168.22.28}
{192.168.22.28:9300}{aws_availability_zone=us-gov-west-1b},
{ip-192-168-12-208}{kKWKmKfGRKCBoi-H3L4hLw}{-g8LwZ2EQg-dRTTah50OsQ}
{192.168.12.208}
{192.168.12.208:9300}{aws_availability_zone=us-gov-west-1a}, 
{ip-192-168-12-157}{5oZHQoTrSgmj8m_CvOuhQg}{syB0AxzUSP6lEqFN5arv6w}
{192.168.12.157}
{192.168.12.157:9300}{aws_availability_zone=us-gov-west-1a}, 

{ip-192-168-22-54}{KlhOzJs1TWaefm09yFNx_A}{0QKxei6VT4K4gCGjFUXLug}
{192.168.22.54}{192.168.22.54:9300}{aws_availability_zone=us-gov-west-1b}] 


which is not a quorum; discovery will continue using [127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 192.168.12.157:9300, 192.168.22.54:9300, 192.168.22.26:9300, 192.168.22.28:9300, 192.168.12.243:9300, 192.168.12.208:9300] from hosts providers and [{ip-192-168-22-26}{dj1O8xKfRhi9tcu4SRNE5A}{oI0Jc8zdTpit6OlAcOOcEw}{192.168.22.26}{192.168.22.26:9300}{aws_availability_zone=us-gov-west-1b}] from last-known cluster state; node term 733, last-accepted version 63839 in term 730

Nodes are added via an auto scaling group. And using the AWS EC2 discovery.

How can I tell the cluster to just take 5 nodes (and not 6 nodes for the master election)

DavidTurner · October 24, 2019, 7:58pm

This setting is ignored in 7.x. The issue is that you have removed more than half of the master-eligible nodes all at once, which is not supported since it means you may have lost data: the latest cluster state might only be on the 6 nodes you removed.

The only safe way to proceed is to restore this cluster from a recent snapshot.

sentient · October 24, 2019, 8:03pm

Thanks David

Indeed some old 6.x habits here. So what are the steps to restart the 5 nodes ?
I'm ok loosing all the data. I took a backup yesterday to S3

BTW: I' maintaining a 6.5.* cluster and a 7.* cluster
It would be nice to get an error when we set 'obsolete' parameters. Not sure how easy this is to do. (I'm sure not everybody works on a daily basis with the latest release)

PUT /_cluster/settings HTTP/1.1
Content-Type: application/json

{
  "transient": {
    "discovery.zen.minimum_master_nodes": "3"
  }
}

sentient · October 24, 2019, 8:24pm

Just thinking about this a bit more.
I had all the shards moved to my 5 nodes. The 6 nodes that I was removing had no data. So no 'data' was lost. It would be nice if there is a way to tell the ClusterFormationFailureHelper to ignore certain state it keeps internally

DavidTurner · October 24, 2019, 8:52pm

The nodes you removed were master-eligible, so although they had no shards they held the metadata needed to correctly interpret the data held in your shards. Although it's stored redundantly to tolerate the loss of a minority of your master-eligible nodes it is not possible to tolerate the loss of more than half. Without that metadata you can get into some very strange data loss situations indeed. Best to start again: wipe all the nodes and start up a brand-new cluster.

Elasticsearch already emits warnings when you set deprecated parameters, both in the Warning response header and in the deprecation log.

system · November 21, 2019, 8:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unable to form cluster after half of cluster nodes was removed - ES 7.1 Elasticsearch	10	2214	July 26, 2019
Elasticsearch cluster issue Elasticsearch	9	700	April 7, 2022
Elastic Search 7.17.9 ClusterFormationFailure Elasticsearch	6	319	August 11, 2023
Master not discovered or elected yet, an election requires at least 2 nodes Elasticsearch	5	4074	February 17, 2020
o.e.c.c.ClusterFormationFailureHelper] [Master-node] master not discovered or elected yet, an election requires a node with id [Y11KbKB-SX2HIYM77Z1Zqg], have discovered Elasticsearch	7	3466	February 4, 2020

Recovering from a clusterFormationFailure

Related topics