Index Shard Restore Failed Except When Restoring From Snapshot

(zylo47) #1


We currently use ES 1.3.2 on Centos 6.5. I am using Python 3 and the Elasticsearch Python API module. Our environments consist of 3 node clusters. Most of the indices are a just single shard replicated 2 or 3 times.

I am creating a process to restore Elasticsearch indices from a snapshot. I am able to create the snapshots with no problem but when I restore using the API I am encountering an error in certain situations.

[2015-10-06 14:43:21,560][DEBUG][cluster.service ] [MYHOST] processing [shard-failed ([jobs-404-][0], node[TkB6by2dTNmOj8mAMDO1WQ], [P], restoring[elasticsearch_snapshots:snapshot_20151006142357], s[INITIALIZING]), reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[jobs-404-][0] failed recovery]; nested: IndexShardRestoreFailedException[[jobs-404-][0] restore failed]; nested: IndexShardRestoreFailedException[[jobs-404-][0] failed to restore snapshot [snapshot_20151006142357]]; nested: IndexShardRestoreFailedException[[jobs-404-][0] failed to read shard snapshot file]; nested: FileNotFoundException[/mnt/elasticsearch_snapshots/indices/jobs-404- (No such file or directory)]; ]]]: done applying updated cluster_state (version: 6019)

When this error happens, it continuously executes unless I stop the cluster and delete the cluster state on the target cluster. I then I have to re-add the snapshot repository.

I noticed is that for this particular index, the error is indicating that it's trying to find the data in /mnt/elasticsearch_snapshots/indices/jobs-404- but that directory is empty. I can see in the snapshot repository that for other indices that successfully restored they have data in the /0 sub-folder.

When I create the snapshot, this is the code snippet that I'm using (include_global_state is False and indices_str is a comma delimited list of indices that I'm including in the snapshot)

        body = '{{"indices": "{}", "include_global_state": "{}"}}'.format(indices_str, str(include_global_state))

        print('Creating snapshot {}'.format(snapshot))

When I try to restore the snapshot, I am using this code restoring one index at a time (ignore_unavailable is True and include_global_state is False)

    body = '{{"indices": "{}", "include_global_state": "{}", "ignore_unavailable": "{}"}}'.format(index, str(include_global_state), str(ignore_unavailable))

When I look at the snapshot it says "shards":{"total":206,"failed":0,"successful":206}. I don't see any reason why that directory would have no data in it.

Is this a bug with 1.3.2 or is it possible I'm using some incorrect settings when making the snapshot?

Any help is appreciated. Let me know what other information I can provide to help troubleshoot the issue.

edit - This definitely appears to be an issue where the snapshot is not creating files in the /0 sub-folder. I need to know if this is a bug or if this is an issue with the way I'm generating the snapshot. Do indices need to be closed before I run the snapshot? Is there a way to verify the snapshot integrity after it's been created?

edit 2 - I did some more digging and I'm able to reproduce the issue. I found one index that existed in the source cluster, was listed in the snapshot, but there were no data written to the snapshot folder for that index. I just need to know if this is a bug with the 1.3.2 version or if it's something else is causing this. The snapshot status says that there were 0 failures and that all of the indices were snapshotted successfully.

edit 3 - The issue is reproducible on a cluster with load. It's not reproducible on a cluster with no load. This is leading me to believe that there's some issue with the snapshot interface when load is on the server on this version.

(system) #3