Index Shard Restore Failed Except When Restoring From Snapshot


(zylo47) #1

Hello,

We currently use ES 1.3.2 on Centos 6.5. I am using Python 3 and the Elasticsearch Python API module. Our environments consist of 3 node clusters. Most of the indices are a just single shard replicated 2 or 3 times.

I am creating a process to restore Elasticsearch indices from a snapshot. I am able to create the snapshots with no problem but when I restore using the API I am encountering an error in certain situations.

[2015-10-06 14:43:21,560][DEBUG][cluster.service ] [MYHOST] processing [shard-failed ([jobs-404-01.06.15.17.48.01][0], node[TkB6by2dTNmOj8mAMDO1WQ], [P], restoring[elasticsearch_snapshots:snapshot_20151006142357], s[INITIALIZING]), reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[jobs-404-01.06.15.17.48.01][0] failed recovery]; nested: IndexShardRestoreFailedException[[jobs-404-01.06.15.17.48.01][0] restore failed]; nested: IndexShardRestoreFailedException[[jobs-404-01.06.15.17.48.01][0] failed to restore snapshot [snapshot_20151006142357]]; nested: IndexShardRestoreFailedException[[jobs-404-01.06.15.17.48.01][0] failed to read shard snapshot file]; nested: FileNotFoundException[/mnt/elasticsearch_snapshots/indices/jobs-404-01.06.15.17.48.01/0/snapshot-snapshot_20151006142357 (No such file or directory)]; ]]]: done applying updated cluster_state (version: 6019)

When this error happens, it continuously executes unless I stop the cluster and delete the cluster state on the target cluster. I then I have to re-add the snapshot repository.

I noticed is that for this particular index, the error is indicating that it's trying to find the data in /mnt/elasticsearch_snapshots/indices/jobs-404-01.06.15.17.48.01/0/snapshot-snapshot_20151006142357 but that directory is empty. I can see in the snapshot repository that for other indices that successfully restored they have data in the /0 sub-folder.

When I create the snapshot, this is the code snippet that I'm using (include_global_state is False and indices_str is a comma delimited list of indices that I'm including in the snapshot)

...
        body = '{{"indices": "{}", "include_global_state": "{}"}}'.format(indices_str, str(include_global_state))

    try:
        print('Creating snapshot {}'.format(snapshot))
        self._snapshot_client.create(repository=self._snapshot_repository,
                                     snapshot=snapshot,
                                     body=body,
                                     wait_for_completion=False,
                                     request_timeout=60)
...

When I try to restore the snapshot, I am using this code restoring one index at a time (ignore_unavailable is True and include_global_state is False)

...
    body = '{{"indices": "{}", "include_global_state": "{}", "ignore_unavailable": "{}"}}'.format(index, str(include_global_state), str(ignore_unavailable))
    self._snapshot_client.restore(repository=self._snapshot_repository,
                                  snapshot=snapshot,
                                  body=body,
                                  wait_for_completion=True,
                                  request_timeout=600)
...

When I look at the snapshot it says "shards":{"total":206,"failed":0,"successful":206}. I don't see any reason why that directory would have no data in it.

Is this a bug with 1.3.2 or is it possible I'm using some incorrect settings when making the snapshot?

Any help is appreciated. Let me know what other information I can provide to help troubleshoot the issue.

edit - This definitely appears to be an issue where the snapshot is not creating files in the /0 sub-folder. I need to know if this is a bug or if this is an issue with the way I'm generating the snapshot. Do indices need to be closed before I run the snapshot? Is there a way to verify the snapshot integrity after it's been created?

edit 2 - I did some more digging and I'm able to reproduce the issue. I found one index that existed in the source cluster, was listed in the snapshot, but there were no data written to the snapshot folder for that index. I just need to know if this is a bug with the 1.3.2 version or if it's something else is causing this. The snapshot status says that there were 0 failures and that all of the indices were snapshotted successfully.

edit 3 - The issue is reproducible on a cluster with load. It's not reproducible on a cluster with no load. This is leading me to believe that there's some issue with the snapshot interface when load is on the server on this version.


(system) #3