Unfortunately, due to the circumstances of where I work I can't be to specific or share screen shots, but hopefully I can get you guys in the ballpark. We have a cluster that is made up of 7 data nodes (dell r640s) a web server which has kibana as well as other apps (dell r640), and 3 logstash servers which I believe are r640s also. All of our indices have been set to 3 primary shards and 1 replica each. I can get you a break down of cpus, ram and # of drives if that will help out as well.
Recently, one of our data nodes died as the result of a motherboard failure. We are pretty sure that we haven't had any data loss, but not positive. We had 175 unassigned shards by the time the issue was discovered. Is it possible there was data loss?
As we were responding to an immediate issue I am sure we did something wrong, but we brought another node online quickly. We were hoping that the cluster would rebalance automatically for us, but that didn't happen. I am currently reducing all of the unassinged shards to 0 replicas then changing them to 1 replicas and all of the new shards are being automatically assinged to the new node which is good.
That brings me to my questions:
How come when we brought the new node online the unassigned shards weren't automatically assigned? Also, how come the existing shards didn't rebalance so the cluster was in a healthier posture?
Is there any data on the failed node that we want to get to? We are discussing getting rid of the compromised node and just bringing a new server online in its place. If we do this will there be unintentional data loss? It is my understanding that due to the existence of the replicas on the other nodes we should be safe.
By default both of these things would have happened, but there's quite a few ways you could configure things not to do so. The cluster allocation explain API would tell you why shards are/aren't assigned as they are.
Is your cluster health red? If it's yellow then you're safe, you still have a good copy of every shard.
This is unnecessary and increases your risk of data loss since it deletes some of the good shard copies too.
We are getting the response "cannot allocate because allocation is not permitted to any of the nodes". I think this happened originally when we went from 7 to 6 nodes due to the failure. We were already resource constrained before the failure, when the node went down the cluster assigned as many shards as it could to the remaining nodes. Then the cluster stayed in a red state (where it still is) because there was literally not any space on the remaining nodes. What is strange is that they aren't being allocated to the new node now that it is online.
Cluster state is red.
What would be a better solution? I was thinking about doing a cold restart, but that seems like a pretty intense solution for a problem that probably has a more graceful answer.
That's the summary, but the details will explain why allocation is not permitted anywhere. EDIT: In particular, you need to look for the reason why unassigned shards are not permitted to be assigned to the brand-new node.
In which case you've lost all copies of at least one shard (which is strange if every shard had a replica and you only lost one node). Probably simplest to restore any lost indices from a recent snapshot.
Work out why shards are not being allocated as you expect (see above) and address those reasons.
I am reading you thread and I have some issues understanding you cluster layout. You have 3 primary shards with 1 replica and you have 7 data nodes of which 1 died with a failed motherboard.
Statistically you have either lost 1 primary or 1 replica unless you had multiple shards assigned to the same host from the same index.
My guess would be that you have a full disk somewhere in your cluster and that's why your cluster is red. With the setup you describes you should be ok with loosing a node.
Your earlier response got me thinking (which is dangerous). I tried the reroute api, and got errored out. Part of the error message told me to do POST _cluster/reroute?retry_failed. I fired that off, and I am not sure what that changed within the cluster, but I am seeing a lot of movement now. A number of replicas that I haven't specifically addressed are now in the shard activity queue. I am hoping that is good news.
I ran the reroute command on one of the replica shards that showed up when I ran the allocation explain API. The reroute API was accepted and the shard is now in the shard activity queue as well. I am cautiously optimistic at the moment.
I am hoping that I will be able to run down the primary shards that got lost or affected by the loss of that node. I am also confused as to how we lost a primary, but hopefully some research sheds some light here.
Thanks!
I also set cluster.routing.allocation.enable: all just in case that got affected somehow.
Sounds good to me If the message said to try POST _cluster/reroute?retry_failed then that means that it had made several attempts to allocate the shard all of which had ended in failure. That's also rather unusual and unless you've addressed the reason why those allocation attempts were failing then it might fail again. But it might succeed too, we can't really say without seeing the details.
That is my understanding of the situation as well. I am still pretty confused as to how we are in a red state. We were definitely constrained as far as disk space is concerned, and we have been advocating for at least one more node for months now. As I am typing this I am remembering seeing something about watermark issues, which is why there were so many unassigned shards. The remaining six nodes all hit their watermarks and no longer accepted the recently displaced shards.
This still doesn't make sense as to why the cluster is red rather than yellow. Thankfully the shards are now allocating automatically and the cluster seems to be balancing itself well.
If you had disk space issues prior to losing the node it is possible that some shards were already in yellow state at the time of the failure, which would mean no replica existed. Any primary shards allocated to the failed node without a corresponding replica would then result in it being lost and data missing.
So things are slowly coming back online and Christian I think you are right. Right now we only have one index that is red. The issue seems to be two fold. We have issues first with rebalancing. We are have watermark issues on the six nodes. The seventh node that we just brought online can only handle 10 incoming shards at once. As it hit that number it threw a throttling error which was nested under the unable to balance error.
It then moved to the the next node in the cluster to try to put the index there. That node was having watermark issues then each following node threw "worse balance".
It's a bummer we are getting proved right this way, but we are just plain resource constrained on our cluster. We are going to get the broken node back and add an additional node to the cluster. Once, that happens I think we will be in a lot better shape. Or we will at least be more resilient if something like this should happen again.
Though the last question I have:
Why didn't shards start going to the new node as soon as we brought it online? Things seem to have started flowing once we did the reroute retry API. Is that all we had to do all along, because those unassigned shards were unable to get assigned due to disk space issues on the remaining nodes? Once we brought the new node online we just had to instruct the cluster to retry assigning the failed shards again, and once it got that instruction and had the room on the new node things started flowing.
Hmm. What version are you running? The default throttle is 2 shards at once and it's normally a good idea to leave this setting alone (but adjust indices.recovery.max_bytes_per_sec if your cluster can handle the extra load).
It had already repeatedly tried, and failed, to allocate those shards, and eventually just marked them as bad to avoid wasting time on further attempts. Why were those allocation attempts failing though?
I am not going to mess with the throttling at this point as things are working, and I don't want to add any strain to the system as there isn't a huge rush to fix this immediately. The pace the shards are being assigned now is more than enough.
Those allocation attempts were stopped because of circuit breaker issues if my memory serves. I know I saw circuit breaker and high water mark issues. I stupidly didn't save anything off, as things were a little tense when it all happened.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.