Restoring indexes with curator


(Dmytro Kulyk) #1

Hi, I have a problem with restoring indexes from one cluster to another.
I'm using curator tool with actionfile:

>     actions:
>       1:
>         action: close
>         options:
>           delete_aliases: False
>           disable_action: False
>           ignore_empty_list: True
>         filters:
>         - filtertype: pattern
>           kind: regex
>           value: '(.*audit.*|.*debug.*|.*erp.*)'
>       
>       2:
>         action: restore
>         options:
>           repository: full_backup
>           name: full_backup
>           indices:
>           wait_interval: 10
>           max_wait: 3600
>           include_aliases: False
>           ignore_unavailable: True
>           include_global_state: True
>           partial: False
>           wait_for_completion: True
>           skip_repo_fs_check: True
>           continue_if_exception: True
>           disable_action: False
>         filters:
>         - filtertype: state
>           state: SUCCESS

and when it tries to restore indexes it throughs an error:

> ERROR Failed to complete action: restore. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: Unable to obtain recovery information for specified indices. Error: NotFoundError(404, u'index_not_found_exception', u'no such index')

when I'm trying to restore to the same env - it restores without any errors.
Can you please help me?


(Aaron Mildenstein) #2

On cluster 2, have you created the repository full_backup which uses the exact same data store as cluster 1? Ostensibly created with the exact same API call?

What do you see on cluster 2 when you run:

GET /_snapshot

Does it look exactly like what you see on cluster 1 when you run the same thing?


(Dmytro Kulyk) #3

Yes, I did repository. It is the same folder on a shared server and it is mapped to both clusters.

On cluster 2 I see:

{
"full_backup": {
"type": "fs",
"settings": {
"compress": "true",
"location": "/mnt/elk_full_backup"
}
}
}


(Aaron Mildenstein) #4

So, what is the output of:

GET /_snapshot/_status

on each cluster?


(Dmytro Kulyk) #5

on both clusters:

"snapshots": []

But using ElasticHQ and checking snapshots on each cluster I can see this snapshot on both of them.

Also with doing

GET /_cat/snapshots/full_backup

I have

full_backup SUCCESS 1543587225 15:13:45

On both clusters.


(Dmytro Kulyk) #6

In Debug mode I have such log:

INFO               curator.utils          restore_check:1616 Index "index_name" is still in stage "INDEX"
2018-12-03 11:30:18,593 DEBUG              curator.utils            wait_for_it:1754 Response: False
2018-12-03 11:30:18,593 DEBUG              curator.utils            wait_for_it:1774 Action "restore" not yet complete, 169 total seconds elapsed. Waiting 9 seconds before checking again.
2018-12-03 11:30:27,598 DEBUG              curator.utils            wait_for_it:1751 Elapsed time: 181 seconds
2018-12-03 11:30:27,799 ERROR                curator.cli                    run:186  Failed to complete action: restore.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: Unable to obtain recovery information for specified indices. Error: NotFoundError(404, u'index_not_found_exception', u'no such index')

what can it be? can you please help me?
Maybe you need more information? Just tell me what info you need.


(Aaron Mildenstein) #7

This log message, and the subsequent ones, indicate that Curator has been trying to restore for 169 seconds (just under 3 minutes), and suddenly, 9 seconds later, a very unexpected log message comes. After successfully running every 9 seconds for 181 seconds (which is evident in the log messages and time stamps), suddenly, this API call fails:

    response = client.indices.recovery(index=to_csv(index_list), human=True)

Curator is just trying to collect the status of recovering indices, specifically the ones in index_list, and one or more of them is missing, resulting in the 404 error you are seeing.

The thing is, index_list doesn't change over successive iterations. The restore_check function gets called by the wait_for_it function every n seconds until the restore is complete. Completeness is determined by checking each index specified in index_list having all of its recovering shards in a DONE state. If a single shard in any index in index_list does not report DONE, then the restore_check function returns False to the wait_for_it function, which then waits n seconds, and repeats the restore_check calls until it either returns True, or max_wait is reached. The exact same index_list data is sent every time. No alterations or omissions for indices already restored is performed.

In your case, the function is working flawlessly—repeatedly, even—and then stops because the index that it's trying to restore is no longer there (the 404 error). If it were a problem with the cluster, I would expect it to fail on the very first check of the recovery state. But in your case, it's clearly running just fine for around 3 minutes before it suddenly fails. Without more insight and understanding here, I'm forced to guess what's going on.

So, is there some other process that alters the indices/shards while the restore is going on? That's the first thought that comes to mind.


(Dmytro Kulyk) #8

No, there is no any other processes that can use indexes. As you can see in my actionfile, I'm closing all indexes before restoring them.

I've changed include_global_state: True to False and it did restore indexes, but from 52m records it did only 20m and I don't know why... Maybe I should change this parameter in backup actionfile also?


(Aaron Mildenstein) #9

All that include_global_state does with snapshots (and restores) is include any templates, and any persistent cluster settings. I would take a look at the snapshot state.


(Dmytro Kulyk) #10

As you can see, state of snapshot is "Success"

Also, with

POST /_snapshot/full_backup/_verify

I can see that both of clusters see that snapshot and all 6 nodes (3 in each cluster) are shows it
Cluster1:

{
"nodes": {
"kZjywCuhTUiEUVRjGCFMPw": {
"name": "elk3"
},
"Jth2U1CeTSqvmpSiBNFAYg": {
"name": "elk1"
},
"BbGOrWrbRT6EkERSQRtaIw": {
"name": "elk2"
}
}
}

Cluster2:

{
"nodes": {
"6HV9p3ZTQ56_YCYcgHu5Hg": {
"name": "elk1"
},
"MYXtf2LiTTOCAi-rvu21QA": {
"name": "elk3"
},
"qgpSoNgERN25M8sTkfD2TA": {
"name": "elk2"
}
}
}


(Aaron Mildenstein) #11

How is this count determined? What does _cat/indices show on each server?