How to take snapshots in cluster


(Animageofmine) #1

Have a simple and a stupid question, since I couldn't find a direct answer (except here)

We have a cluster of 8 servers, all dockerized with data volume mounted on the host (/var/lib/elasticsearch) and want to set up snapshots. A couple of quick stupid questions:

  1. Do we take snapshot from each node in the cluster?
  2. If answer to 1 is yes, do we create a separate snapshot (different name) for each node?

E.g.
Following are the nodes: esnode1-5, esmaster1-3

Following is the query for snapshot from each node in the cluster, where <nodename_date> could be data1_12162016 or data2_12162016 and so on.

curl -XPUT 'http://localhost:9200/_snapshot/elasticsearch/<nodename_date>?wait_for_completion=true' -d '{
    "ignore_unavailable": "true",
    "include_global_state": false
}'

When we restore, run the following API / request:

curl -XPOST 'localhost:9200/_snapshot/elasticsearch/<nodename_date>/_restore' -d '{
    "ignore_unavailable": "true",
    "include_global_state": false
}'

If answer to question 1 is no, please let me know how snapshots are taken for a dockerized cluster along with an example, if possible. Let me know if you need more information and thank you in advance.


(Mark Walkom) #2

A snapshot it a cluster level action, ie it happens on every node that holds data for the snapshot, and that node is responsible for putting the data into the repo.


(Animageofmine) #3

Thanks. I take your reply as "you need to back up from each node in the cluster".

When we restore the snapshot, how would each node know what to restore from the repository? For example, if index has 5 shards, each shard on one data node, how would each data node know what to restore from snapshot?


(Mark Walkom) #4

Yes, all nodes.

A restore does the entire index, or snapshot. You can't do per shard.


(Animageofmine) #5

Sounds good, I will set this up and give it a shot, thanks for your help!

Just a quick clarification: Should I use the same snapshot name (e.g. snapshot_12162016) from all nodes or separate snapshot name from each node?


(Mark Walkom) #6

You cannot have a per node snapshot.


(Animageofmine) #7

Perfect, thank you for the clarification!


(Christian Dahlqvist) #8

Creating a snapshot is as Mark points out a cluster-level operation, and the generated snapshots are written to a shared storage that all nodes need to have access to, e.g. a network mounted file system, S3 or HDFS.


(Animageofmine) #9

@Christian_Dahlqvist @warkolm

I added AWS plugin and was able to register the repository. However, when I try to take snapshot from each of the nodes in the cluster, I get concurrent snapshot exception.

{
  "error": {
    "root_cause": [
      {
        "type": "concurrent_snapshot_execution_exception",
        "reason": "[essnapshots:12192016]a snapshot is already running"
      }
    ],
    "type": "concurrent_snapshot_execution_exception",
    "reason": "[essnapshots:12192016]a snapshot is already running"
  },
  "status": 503
}  

Following is my query that runs from each node in the cluster:

curl -XPUT 'http://localhost:9200/_snapshot/essnapshots/12192016?wait_for_completion=false' -d '{    
    "ignore_unavailable": "true",
    "include_global_state": false
}'

This means, you can't use the same snapshot name from each node in the cluster unless you use different snapshot name from each node in the cluster. Or you don't have to fire snapshot API from all the nodes in the cluster. I don't know which one is valid and would like to avoid trial and error test. A verbose explanation with an example would be appreciated. We are trying to go live in production. Thank you.


(Christian Dahlqvist) #10

Snapshots are not created per node but for the whole cluster, which is why all nodes need access to the storage.


(Animageofmine) #11

I think all nodes have access to the storage because the snapshot was successful. This would mean that, you only need to trigger snapshot from one of the nodes in the cluster and it will talk with other nodes, grep the data and push it to S3. Let me know if that's not the case. Thanks so much!


(Christian Dahlqvist) #12

It does not matter which node you trigger the snapshot from, so that is correct.


(Animageofmine) #13

sounds good. Thank you so much!


(Animageofmine) #14

@Christian_Dahlqvist @warkolm

Been monitoring the logs and observed the following, which seem to be from AWS connection. I think the logs are probably because of some internal protocol to connect with AWS, but I still wanted to run through you guys, if that's expected.

I am not sure why is it connecting to AWS so many times that it has to close the connection over and over. We did not request any snapshots during this time.

Seems to log every minute based on the pattern (6:04:11, 6:05:11, etc...)

[2016-12-20T06:04:11,002][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS
[2016-12-20T06:05:11,002][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS
[2016-12-20T06:06:11,002][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS
[2016-12-20T06:07:11,002][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS
[2016-12-20T06:08:11,003][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS
[2016-12-20T06:09:11,003][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS
[2016-12-20T06:10:11,003][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS
[2016-12-20T06:11:11,004][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS
[2016-12-20T06:12:11,004][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS
[2016-12-20T06:13:11,004][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS
[2016-12-20T06:14:11,005][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS
[2016-12-20T06:15:11,005][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS

(system) #15

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.