Here's our situation:
Our cluster is made up of 10 VMs:
- 3 masters, one of which is also a data node
- 4 data only nodes
- 2 clients
Elasticsearch version: 1.7.0
A few days ago, I noticed there were some connectivity problems between the nodes. Two of the masters couldn't communicate with the rest of the cluster, plus one master didn't even ping (but he was disconnected from the cluster even beforehand). On top of that, only three data nodes were actually connected to the cluster, together with the client nodes. With some help we were able to revive the dead master, and during that process we also reconnected the other nodes (for some reason we had to activate the masters' network.bind config, although so far they worked just fine without it). I checked the cluster state, and there were many unassigned shards (more than 10000). Thinking about the stressed data nodes, and the fact their storage was just about full (98%), i figured it was pretty natural (although a bit suspicious). So i let the cluster rebalance itself during the weekend, since it was the last day of the week anyway. But when we came back three days later and opened up our kibana, we were horrified to find out a part of our data was gone!
After some investigation, plus help from other people, we concluded the previously dead master node deployed his old cluster state...
Needless to say, all three of our master nodes already joined the cluster and got the bad cluster state. The way I understand your docs, the other two masters were higher on the master candidate list since they were more lately active, but since they didn't find enough masters (2 by our config), they gave up, and then enters the lucky dead master, who reconnected with everyone just in time...
For now, we stopped our cluster routing since some nodes still disconnect from time to time.
Right now, we're considering the dangling indices option.
We also have a snapshot from two months ago, but we'd rather save our data if we could.
Is there anything we can still do?