Cannot restore snapshot, process already running

Hello,

We're having some troubles here with our elasticseach cluster. The cluster is made with 10 nodes, under Debian Jessie with elasticsearch 2.3.4

I'm trying to restore an index with the following command, on one of the 10 nodes.

curl -XPOST "http://localhost:9200/_snapshot/es_backup/es-backup-sno/_restore" -d '{"indices": "index-20160718"}'

The command return this error :
{"error":{"root_cause":[{"type":"concurrent_snapshot_execution_exception","reason":"[es_backup:es-backup-sno] Restore process is already running in this cluster"}],"type":"concurrent_snapshot_execution_exception","reason":"[es_backup:es-backup-sno] Restore process is already running in this cluster"},"status":503}

It looks like a restore is already running. We thinks there an old restore running, with an non existent snapshot (remove in the past), on non existent indexes (remove in the past two).

The command curl -s 'http://localhost:9200/_cluster/state' | jq '.restore' return a restore, using the non existent snapshot on the non existent indexes (yes it's kind a mess...).

es-backup-20160708 is the old snapshot, the old indexes are the index-201605*, shard are in FAILURE state.

{ "snapshots": [ { "snapshot": "es-backup-20160708", "repository": "es_backup", "state": "STARTED", "indices": [ ... "shards": [ { "index": "index-20160527", "shard": 2, "state": "FAILURE" },

We don't know how to kill this running restore, maybe there a tip to do that ?

Thanks,

Have you tried deleting the es-backup-20160708 snapshot?

The es-backup-20160708 snapshot is already removed, when I launch curl -XDELETE 'http://localhost:9200/_snapshot/es_backup/es-backup-20160708' :

{"error":{"root_cause":[{"type":"snapshot_missing_exception","reason":"[es_backup:es-backup-20160708] is missing"}],"type":"snapshot_missing_exception","reason":"[es_backup:es-backup-20160708] is missing","caused_by":{"type":"no_such_file_exception","reason":"/var/backup/elasticsearch/es_backup/snap-es-backup-20160708.dat"}},"status":404}

Is the index index-20160527 still part of the cluster state (i.e. _cat/indices)?

The index is not listed in the cluster state.

This looks like a bug. Can you provide some additional information that can help us figure out why this happened?

  • What kind of repository type did you use?
  • Were there any other failures while restoring the snapshot? For example node crashes?

The only way to unblock the cluster for future restores is to do a full-cluster restart.

Just some more thoughts: What happens if you explicitly delete the index:

curl -XDELETE 'http://localhost:9200/index-20160527'

Also, could you send me the full cluster state (private message if it contains confidential information)?

The backup repository is an NFS share.
There was not other failures during the restore.

We already try removing all the "ghosts" indexes, with no success.

I send you the all state if i can.

We find the solution by stoping all the nodes of the cluster at the same time (we did that before, but not a the exact same time).
After the full stop, we start all the node and the stuck restore process was gone !

Thanks for your help !

I've opened an issue on our Github repo: https://github.com/elastic/elasticsearch/issues/19774

@snoir just to validate our assumptions on the ticket, could you provide me with logs from around the time where the restore was started? In particular we are looking for events such as deleted indices, restarted nodes and changed masters.