Issue with allocation filtering on snapshot restore

Hi, I have just encountered a strange issue when attempting a test restore of my production Elasticsearch cluster to a test server.

It appears I have several indices that have somehow ended up with a node level routing allocation filter (index.routing.allocation.require._id). This is not something I have consciously set, and I can only assume by some other issues I have identified, that it was related to a previous issue with an ILM policy (I don't believe I ever targetted a specific node).

When I go to restore a snapshot (which has taken successfully), I am unable to restore these indices, due to the fact that the restore process cannot allocate any shards, as the node ID does not exist in the target cluster.

Has anyone encountered this issue before? Is there a workaround to avoid this anyone has found? I tried setting the index.routing.allocation.require._id to null on restore, but this did not seem to have any affect.

I have gone back and removed and re-assigned the ILM policy and removed the index.routing.allocation.require._id filter via the index settings API. Once I did this and created a new snapshot, I was able to restore the problem indices successfully.

Have you tried to change settings as below during restore? Is that useless?

POST /_snapshot/my_repository/snapshot_1/_restore
{
  "indices": "index_1",
  "rename_pattern": "(.+)",
  "rename_replacement": "restoretest1",
  "index_settings": {
    // chanage some things here
    "index.number_of_replicas": 0
  }
}

if it not work ,you can use API as follow to find why and then know what to do

GET /_cluster/allocation/explain

To remove a setting at restore time, use the ignore_index_settings parameter. See these docs for more info:

Thanks for that. I had already looked at the allocation explain API as that is how I knew it was complaining about the target/required node being unavailable.
I had also tried renaming the index (before I worked out what was happening), but that doesn't help as it's an index setting

Thanks for your response. I was attempting the restore via the UI (7.16.3) and that restore option is not available. Hopefully it is now available in v8 and I will see it when I get my upgrade completed.

The options available are to override settings (tried that and it didn't help) or to reset settings to default, and that is not a setting it will let me define.

I will try another restore solely using the API from one of the snapshots containing the indices with the problem setting and see if it works.

It would be good to see the actual restore error in the UI (or even the Elasticsearch log) during a restore to say 'this index failed to restore due to missing target nodes / shard allocation failed' instead of the restore being successful, no errors anywhere, yet the indices are open and offline with no shards allocated. The only reason I noticed these indices were not available was due to doing basic due diligence checks to make sure everything was working post the restore.

I have done a final test this morning and confirmed that that workaround behaves as expected. I've also re-confirmed the failure scenario and there are definitely no log events anywhere, showing that the restore failed (the restore also does not show up in the Restore Status UI). Basically all that happens is it creates the empty index metadata with no shards and the allocation error. This lack of error handling and notification to the user is in my view a bug, and given it is directly related to the restore of snapshots should be addressed fairly urgently, as the last thing you need in a cluster failure scenario is a problematic snapshot restore experience. I haven't had a chance to verify this behaviour in V8 as I am yet to upgrade, but can confirm it is present in 7.16.3 (restoring a snapshot taken on 7.16.2)

I think this is covered by the section of the manual entitled Monitor a restore:

You can also specify ?wait_for_completion to the restore API as long as you are willing to wait patiently enough.

These options are not new either, they are available in all non-EOL versions.

Thanks David, except there is nothing to monitor as there is no event, no log, no nothing... If there is a UI feature for it, shouldn't it appear there? If I am driving things via the UI using standard procedures (as per Elastic's documentation) surely, the Restore Status UI should show the status of the restore I have just attempted to trigger?

I do not have bandwidth unfortunately to go and do a whole detailed collection of scenario testing, but to me there is an issue there as you should not have to switch between the UI and Dev Tools / API calls. If you do it in one, the data should be there.

I am at home using the API so that is not a concern, but the UI features need improvement.

I hear you, but I'm not involved with the UI side of things so there's not a lot I can do about it. If you think you've found bugs then please report them on Github.

Thanks David. You have been a big help and gave me the explanation for how to get around my restore issue. I'm not sure I'm mentally up for arguing my case on GitHub at the moment (been there, done that in the past), but if I feel inclined I will do in the future :slight_smile:

Ok, no probs, I'll try and find a way to route your feedback to the right people.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.