Accidentally loaded old cluster state

blackBagel · August 14, 2016, 8:14pm

Here's our situation:

Our cluster is made up of 10 VMs:

3 masters, one of which is also a data node
4 data only nodes
2 clients

Elasticsearch version: 1.7.0

A few days ago, I noticed there were some connectivity problems between the nodes. Two of the masters couldn't communicate with the rest of the cluster, plus one master didn't even ping (but he was disconnected from the cluster even beforehand). On top of that, only three data nodes were actually connected to the cluster, together with the client nodes. With some help we were able to revive the dead master, and during that process we also reconnected the other nodes (for some reason we had to activate the masters' network.bind config, although so far they worked just fine without it). I checked the cluster state, and there were many unassigned shards (more than 10000). Thinking about the stressed data nodes, and the fact their storage was just about full (98%), i figured it was pretty natural (although a bit suspicious). So i let the cluster rebalance itself during the weekend, since it was the last day of the week anyway. But when we came back three days later and opened up our kibana, we were horrified to find out a part of our data was gone!
After some investigation, plus help from other people, we concluded the previously dead master node deployed his old cluster state...

Needless to say, all three of our master nodes already joined the cluster and got the bad cluster state. The way I understand your docs, the other two masters were higher on the master candidate list since they were more lately active, but since they didn't find enough masters (2 by our config), they gave up, and then enters the lucky dead master, who reconnected with everyone just in time...

For now, we stopped our cluster routing since some nodes still disconnect from time to time.

Right now, we're considering the dangling indices option.

We also have a snapshot from two months ago, but we'd rather save our data if we could.

Is there anything we can still do?

blackBagel · August 14, 2016, 8:17pm

There's an extra data node I forgot to count...

warkolm · August 14, 2016, 9:28pm

Do you have minimum masters set on all nodes?

What version are you on?

blackBagel · August 15, 2016, 5:53am

Unfortunately, the discovery.zen.minimum_master_nodes is unset, so it's 1 by default. Our bad.

I wrote in the first message :
Elasticsearch version: 1.7.0
On all nodes.

Topic		Replies	Views
Recovering Cluster when state files are corrupt Elasticsearch	6	1246	June 3, 2020
Rejoin master-data node back to cluster Elasticsearch	2	1183	February 6, 2019
Cluster is down and master nodes are not coming up Elasticsearch	17	2305	June 26, 2019
I have issues with my cluster and I want to re-build it. Need advices! Elasticsearch	5	442	October 4, 2019
Half master nodes are down, Is there a way to recover cluster? Elastic Search	1	35	March 13, 2025

Accidentally loaded old cluster state

Related topics