and when it tries to restore indexes it throughs an error:
> ERROR Failed to complete action: restore. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: Unable to obtain recovery information for specified indices. Error: NotFoundError(404, u'index_not_found_exception', u'no such index')
when I'm trying to restore to the same env - it restores without any errors.
Can you please help me?
On cluster 2, have you created the repository full_backup which uses the exact same data store as cluster 1? Ostensibly created with the exact same API call?
What do you see on cluster 2 when you run:
GET /_snapshot
Does it look exactly like what you see on cluster 1 when you run the same thing?
This log message, and the subsequent ones, indicate that Curator has been trying to restore for 169 seconds (just under 3 minutes), and suddenly, 9 seconds later, a very unexpected log message comes. After successfully running every 9 seconds for 181 seconds (which is evident in the log messages and time stamps), suddenly, this API call fails:
Curator is just trying to collect the status of recovering indices, specifically the ones in index_list, and one or more of them is missing, resulting in the 404 error you are seeing.
The thing is, index_list doesn't change over successive iterations. The restore_check function gets called by the wait_for_it function every n seconds until the restore is complete. Completeness is determined by checking each index specified in index_list having all of its recovering shards in a DONE state. If a single shard in any index in index_list does not report DONE, then the restore_check function returns False to the wait_for_it function, which then waits n seconds, and repeats the restore_check calls until it either returns True, or max_wait is reached. The exact same index_list data is sent every time. No alterations or omissions for indices already restored is performed.
In your case, the function is working flawlessly—repeatedly, even—and then stops because the index that it's trying to restore is no longer there (the 404 error). If it were a problem with the cluster, I would expect it to fail on the very first check of the recovery state. But in your case, it's clearly running just fine for around 3 minutes before it suddenly fails. Without more insight and understanding here, I'm forced to guess what's going on.
So, is there some other process that alters the indices/shards while the restore is going on? That's the first thought that comes to mind.
No, there is no any other processes that can use indexes. As you can see in my actionfile, I'm closing all indexes before restoring them.
I've changed include_global_state: True to False and it did restore indexes, but from 52m records it did only 20m and I don't know why... Maybe I should change this parameter in backup actionfile also?
All that include_global_state does with snapshots (and restores) is include any templates, and any persistent cluster settings. I would take a look at the snapshot state.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.