Elasticsearch recovery tools

One issue that seem to be problematic (and reoccurring) is that there isn't much information in the error messages when things like "NODE_LEFT" / "no_valid_shard_copy" populate error messages when performing things like _cluster/reroute?retry_failed

What procedure should one use when they see a node has left in a cluster with 100+ nodes and we don't know which node last contained the shards that are missing? We had a major issue yesterday during a power loss where half the shards were missing and retry didn't work and our health was around 49% due to missing shards.

I'm sure the shards were available on the node once it rejoined, but something happened in between the moment when the node left and then rejoined where the cluster could not find the shard data. It would be much more helpful in the error messages if they stated something like "Shard xyz last known location: node-47" or something like that. With auto balancing turned on, shards can constantly move between nodes and when there is a major power outage and shards are missing when the cluster comes back online, we have no clue what tools to use to figure out where a shard was last located.

I checked the Elasticsearch website and the forums and there were various posts to different scripts that would help with recovery operations but recovery guides seem to be lacking (or I'm just not aware they exist / can't locate them) that would be really beneficial when a cluster has a major issue.

Here are some questions:

  1. Is there a centralized place on your website that talks about recovery, scripts that aid in recovery and finding data even if some of the data goes missing / damaged. Partial recovery can be a lot better than no recovery.

  2. Are there any tools that help with recovery operations? For instance, a tool that will scan all nodes for shard data that exists but wasn't loaded into the cluster for various reasons? Again, this would be hugely beneficial.

  3. What would cause a node disconnecting and then reconnecting but Elasticsearch throwing an error that the node_left and the data is not available? We saw all nodes eventually reconnect but we lost ~51% of all shards to this issue. (Granted, we were running the cluster with replicas disabled due to a lack of funds initially for the project and that's definitely on us).

To move forward, I will be spending the next two weeks redoing things from scratch and enabling replicas (at least one replica set) but I want to be prepared in the future for the eventual failure of shards and the lack of tools (or my inability to find them) leaves me feeling a bit anxious when the next data integrity loss event occurs.

Thanks again for all your help! Love the product -- I just hope we can find solid tools to aid in data recovery.

I was able to see a hex value for the node in the retry output and then scanned the log files for that hex value and we indeed found a node that started but the elasticsearch service did not start -- so we're on our way to recovery!

One issue though -- instead of just showing the hex value, why not show the last known name of the node as well? It would save taking time to cross reference the hex value to find the real node name. Unless there is a node / cluster / cat command that does this?

Thanks!

This is really two questions. Elasticsearch itself looks after finding the missing shard data when nodes return to the cluster, but you need to address the fact that there are nodes missing from the cluster. Knowing which nodes held which shards doesn't really matter for that task.

If you're running a fairly recent version then you can follow the troubleshooting guides in the manual:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.