One issue that seem to be problematic (and reoccurring) is that there isn't much information in the error messages when things like "NODE_LEFT" / "no_valid_shard_copy" populate error messages when performing things like _cluster/reroute?retry_failed
What procedure should one use when they see a node has left in a cluster with 100+ nodes and we don't know which node last contained the shards that are missing? We had a major issue yesterday during a power loss where half the shards were missing and retry didn't work and our health was around 49% due to missing shards.
I'm sure the shards were available on the node once it rejoined, but something happened in between the moment when the node left and then rejoined where the cluster could not find the shard data. It would be much more helpful in the error messages if they stated something like "Shard xyz last known location: node-47" or something like that. With auto balancing turned on, shards can constantly move between nodes and when there is a major power outage and shards are missing when the cluster comes back online, we have no clue what tools to use to figure out where a shard was last located.
I checked the Elasticsearch website and the forums and there were various posts to different scripts that would help with recovery operations but recovery guides seem to be lacking (or I'm just not aware they exist / can't locate them) that would be really beneficial when a cluster has a major issue.
Here are some questions:
-
Is there a centralized place on your website that talks about recovery, scripts that aid in recovery and finding data even if some of the data goes missing / damaged. Partial recovery can be a lot better than no recovery.
-
Are there any tools that help with recovery operations? For instance, a tool that will scan all nodes for shard data that exists but wasn't loaded into the cluster for various reasons? Again, this would be hugely beneficial.
-
What would cause a node disconnecting and then reconnecting but Elasticsearch throwing an error that the node_left and the data is not available? We saw all nodes eventually reconnect but we lost ~51% of all shards to this issue. (Granted, we were running the cluster with replicas disabled due to a lack of funds initially for the project and that's definitely on us).
To move forward, I will be spending the next two weeks redoing things from scratch and enabling replicas (at least one replica set) but I want to be prepared in the future for the eventual failure of shards and the lack of tools (or my inability to find them) leaves me feeling a bit anxious when the next data integrity loss event occurs.
Thanks again for all your help! Love the product -- I just hope we can find solid tools to aid in data recovery.