Index Shard Restore Failed Except When Restoring From Snapshot

zylo47 · October 6, 2015, 7:24pm

Hello,

We currently use ES 1.3.2 on Centos 6.5. I am using Python 3 and the Elasticsearch Python API module. Our environments consist of 3 node clusters. Most of the indices are a just single shard replicated 2 or 3 times.

I am creating a process to restore Elasticsearch indices from a snapshot. I am able to create the snapshots with no problem but when I restore using the API I am encountering an error in certain situations.

[2015-10-06 14:43:21,560][DEBUG][cluster.service ] [MYHOST] processing [shard-failed ([jobs-404-01.06.15.17.48.01][0], node[TkB6by2dTNmOj8mAMDO1WQ], [P], restoring[elasticsearch_snapshots:snapshot_20151006142357], s[INITIALIZING]), reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[jobs-404-01.06.15.17.48.01][0] failed recovery]; nested: IndexShardRestoreFailedException[[jobs-404-01.06.15.17.48.01][0] restore failed]; nested: IndexShardRestoreFailedException[[jobs-404-01.06.15.17.48.01][0] failed to restore snapshot [snapshot_20151006142357]]; nested: IndexShardRestoreFailedException[[jobs-404-01.06.15.17.48.01][0] failed to read shard snapshot file]; nested: FileNotFoundException[/mnt/elasticsearch_snapshots/indices/jobs-404-01.06.15.17.48.01/0/snapshot-snapshot_20151006142357 (No such file or directory)]; ]]]: done applying updated cluster_state (version: 6019)

When this error happens, it continuously executes unless I stop the cluster and delete the cluster state on the target cluster. I then I have to re-add the snapshot repository.

I noticed is that for this particular index, the error is indicating that it's trying to find the data in /mnt/elasticsearch_snapshots/indices/jobs-404-01.06.15.17.48.01/0/snapshot-snapshot_20151006142357 but that directory is empty. I can see in the snapshot repository that for other indices that successfully restored they have data in the /0 sub-folder.

When I create the snapshot, this is the code snippet that I'm using (include_global_state is False and indices_str is a comma delimited list of indices that I'm including in the snapshot)

...
        body = '{{"indices": "{}", "include_global_state": "{}"}}'.format(indices_str, str(include_global_state))

    try:
        print('Creating snapshot {}'.format(snapshot))
        self._snapshot_client.create(repository=self._snapshot_repository,
                                     snapshot=snapshot,
                                     body=body,
                                     wait_for_completion=False,
                                     request_timeout=60)
...

When I try to restore the snapshot, I am using this code restoring one index at a time (ignore_unavailable is True and include_global_state is False)

...
    body = '{{"indices": "{}", "include_global_state": "{}", "ignore_unavailable": "{}"}}'.format(index, str(include_global_state), str(ignore_unavailable))
    self._snapshot_client.restore(repository=self._snapshot_repository,
                                  snapshot=snapshot,
                                  body=body,
                                  wait_for_completion=True,
                                  request_timeout=600)
...

When I look at the snapshot it says "shards":{"total":206,"failed":0,"successful":206}. I don't see any reason why that directory would have no data in it.

Is this a bug with 1.3.2 or is it possible I'm using some incorrect settings when making the snapshot?

Any help is appreciated. Let me know what other information I can provide to help troubleshoot the issue.

edit - This definitely appears to be an issue where the snapshot is not creating files in the /0 sub-folder. I need to know if this is a bug or if this is an issue with the way I'm generating the snapshot. Do indices need to be closed before I run the snapshot? Is there a way to verify the snapshot integrity after it's been created?

edit 2 - I did some more digging and I'm able to reproduce the issue. I found one index that existed in the source cluster, was listed in the snapshot, but there were no data written to the snapshot folder for that index. I just need to know if this is a bug with the 1.3.2 version or if it's something else is causing this. The snapshot status says that there were 0 failures and that all of the indices were snapshotted successfully.

edit 3 - The issue is reproducible on a cluster with load. It's not reproducible on a cluster with no load. This is leading me to believe that there's some issue with the snapshot interface when load is on the server on this version.

Topic		Replies	Views
Snapshot restore process is not finished Elasticsearch	4	2736	July 6, 2017
Snapshot restore failure elasticsearch 6.8.7 Elasticsearch	3	632	April 14, 2020
failed to restore snapshot - IndexShardRestoreFailedException file not found Elasticsearch	2	1438	August 28, 2014
How to restore a snapshot taken in a 3 node cluster on to a fresh 3 node cluster setup Elasticsearch	2	741	July 5, 2017
Snapshot restoration failures - FileAlreadyExistsException Elasticsearch snapshot-and-restore	4	528	August 26, 2021

Index Shard Restore Failed Except When Restoring From Snapshot

Related topics