Hello,
We currently use ES 1.3.2 on Centos 6.5. I am using Python 3 and the Elasticsearch Python API module. Our environments consist of 3 node clusters. Most of the indices are a just single shard replicated 2 or 3 times.
I am creating a process to restore Elasticsearch indices from a snapshot. I am able to create the snapshots with no problem but when I restore using the API I am encountering an error in certain situations.
[2015-10-06 14:43:21,560][DEBUG][cluster.service ] [MYHOST] processing [shard-failed ([jobs-404-01.06.15.17.48.01][0], node[TkB6by2dTNmOj8mAMDO1WQ], [P], restoring[elasticsearch_snapshots:snapshot_20151006142357], s[INITIALIZING]), reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[jobs-404-01.06.15.17.48.01][0] failed recovery]; nested: IndexShardRestoreFailedException[[jobs-404-01.06.15.17.48.01][0] restore failed]; nested: IndexShardRestoreFailedException[[jobs-404-01.06.15.17.48.01][0] failed to restore snapshot [snapshot_20151006142357]]; nested: IndexShardRestoreFailedException[[jobs-404-01.06.15.17.48.01][0] failed to read shard snapshot file]; nested: FileNotFoundException[/mnt/elasticsearch_snapshots/indices/jobs-404-01.06.15.17.48.01/0/snapshot-snapshot_20151006142357 (No such file or directory)]; ]]]: done applying updated cluster_state (version: 6019)
When this error happens, it continuously executes unless I stop the cluster and delete the cluster state on the target cluster. I then I have to re-add the snapshot repository.
I noticed is that for this particular index, the error is indicating that it's trying to find the data in /mnt/elasticsearch_snapshots/indices/jobs-404-01.06.15.17.48.01/0/snapshot-snapshot_20151006142357 but that directory is empty. I can see in the snapshot repository that for other indices that successfully restored they have data in the /0 sub-folder.
When I create the snapshot, this is the code snippet that I'm using (include_global_state is False and indices_str is a comma delimited list of indices that I'm including in the snapshot)
...
body = '{{"indices": "{}", "include_global_state": "{}"}}'.format(indices_str, str(include_global_state))
try:
print('Creating snapshot {}'.format(snapshot))
self._snapshot_client.create(repository=self._snapshot_repository,
snapshot=snapshot,
body=body,
wait_for_completion=False,
request_timeout=60)
...
When I try to restore the snapshot, I am using this code restoring one index at a time (ignore_unavailable is True and include_global_state is False)
...
body = '{{"indices": "{}", "include_global_state": "{}", "ignore_unavailable": "{}"}}'.format(index, str(include_global_state), str(ignore_unavailable))
self._snapshot_client.restore(repository=self._snapshot_repository,
snapshot=snapshot,
body=body,
wait_for_completion=True,
request_timeout=600)
...
When I look at the snapshot it says "shards":{"total":206,"failed":0,"successful":206}. I don't see any reason why that directory would have no data in it.
Is this a bug with 1.3.2 or is it possible I'm using some incorrect settings when making the snapshot?
Any help is appreciated. Let me know what other information I can provide to help troubleshoot the issue.
edit - This definitely appears to be an issue where the snapshot is not creating files in the /0 sub-folder. I need to know if this is a bug or if this is an issue with the way I'm generating the snapshot. Do indices need to be closed before I run the snapshot? Is there a way to verify the snapshot integrity after it's been created?
edit 2 - I did some more digging and I'm able to reproduce the issue. I found one index that existed in the source cluster, was listed in the snapshot, but there were no data written to the snapshot folder for that index. I just need to know if this is a bug with the 1.3.2 version or if it's something else is causing this. The snapshot status says that there were 0 failures and that all of the indices were snapshotted successfully.
edit 3 - The issue is reproducible on a cluster with load. It's not reproducible on a cluster with no load. This is leading me to believe that there's some issue with the snapshot interface when load is on the server on this version.