Hi there,
Unfortunately, due to the circumstances of where I work I can't be to specific or share screen shots, but hopefully I can get you guys in the ballpark. We have a cluster that is made up of 7 data nodes (dell r640s) a web server which has kibana as well as other apps (dell r640), and 3 logstash servers which I believe are r640s also. All of our indices have been set to 3 primary shards and 1 replica each. I can get you a break down of cpus, ram and # of drives if that will help out as well.
Recently, one of our data nodes died as the result of a motherboard failure. We are pretty sure that we haven't had any data loss, but not positive. We had 175 unassigned shards by the time the issue was discovered. Is it possible there was data loss?
As we were responding to an immediate issue I am sure we did something wrong, but we brought another node online quickly. We were hoping that the cluster would rebalance automatically for us, but that didn't happen. I am currently reducing all of the unassinged shards to 0 replicas then changing them to 1 replicas and all of the new shards are being automatically assinged to the new node which is good.
That brings me to my questions:
- How come when we brought the new node online the unassigned shards weren't automatically assigned? Also, how come the existing shards didn't rebalance so the cluster was in a healthier posture?
- Is there any data on the failed node that we want to get to? We are discussing getting rid of the compromised node and just bringing a new server online in its place. If we do this will there be unintentional data loss? It is my understanding that due to the existence of the replicas on the other nodes we should be safe.
Thanks so much for any help! Alex