Taking Snapshot in local repository on multiple nodes not having shared FS


(Ashish Goel) #1

Hi there,

I have placed my cluster nodes on EC2 instances and when I try to take a snapshot on a local FS repository, it gives an error saying that the target location for snapshot is not shared.
Then, I made use of a S3 repository for taking snapshots. It works well, the only problem is that when it is taking snapshots, the process blocks any updates over ES. I read something regarding this which mentioned that ES needs to take into account the details of the previous snapshots taken (figuring out the segments involved) and due to this it takes some time in creating the new one. I already have a cron job setup which deletes a week old snapshots but even after that it takes some time to create the new snapshots.
So, I was wondering if there is a way I can move out of S3, make use of a local repository. Please note that these nodes are not having any shared mount space. AWS recently launched EFS but it is not available in the region where my ES cluster is operating.

Thanks


(Mark Walkom) #2

Can you elaborate this more, what do you mean?

No, you can't. It has to be shared.


(Ashish Goel) #3

Thanks Mark. Since it has to be shared, maybe I will have to wait for AWS to provide EFS support in my region.

For the updates being blocked:
While creating a snapshot, ES needs to take a snapshot of how the data is as of the point of time the process started. So, if the data is changing this process will get complicated. That gave us the idea that ES might be freezing all updates during that decision phase. When we took note of our network traffic in and out of ES nodes, it was noticed that average latency of the calls shoots up by ~15 sec every time the snapshot was being taken without fail which was same amount of time as reported in curator logs.
Also, one should take note that there is a S3 repository in play here, so there are other elements involved here such as latency issues, network wait and the amount of data downloaded/uploaded during this process causing the decision phase to last for 15 sec or so.


(system) #4