Elastic Search Backup and Recovery

ElasticFan · July 16, 2012, 4:35pm

Hi all,

We recently set up a big cluster where everyday we index around 50 million
records cumulatively sized over 40 GB, we use 3 big machines, 128 GB RAM,
dual hex core processors and 5 TB disk (only 7200 rpm but RAID 5). We do an
index per day as per the recommendations given in this discussion group.

We kept the setting as shard=2 and replica=2. This is where the question
begins. We would like to have an incremental Backup process so that if
there are any data corruption or accidental deletes or even if the whole
cluster goes down, we should be able to recover the data in full form.
Initially my plan was to do the following.

Do FS Gateway to a remote system. Say a separate machine which will
have 5 TB storage with RAID 5 and 15,000 rpm. (Is this doable?)
Do regular tape backup on this backup machine.
Restore the data to the Gateway

By doing this, if we find a problem, we will be able to rollback to a
certain point in time. My assumption here is that the FS Gateway will
persist both the state and data of the cluster. Is this assumption correct
and is this recommended? If not, what are the other ways to do a full data
recovery, I have replica:2 so the data will be present in 2 nodes all the
time and so I can not go with a single machine data backup. Please help me
with this.

Regards,
KS

Berkay_Mollamustafao · July 17, 2012, 1:43am

A quick correction. If you have set replica to 2, you have 2 copies plus
the master hence total of 3 copies. So you have same data at all nodes. If
you want to have just one copy, set replica to just 1.

Berkay

On Monday, July 16, 2012, ElasticFan wrote:

Hi all,

We recently set up a big cluster where everyday we index around 50 million
records cumulatively sized over 40 GB, we use 3 big machines, 128 GB RAM,
dual hex core processors and 5 TB disk (only 7200 rpm but RAID 5). We do an
index per day as per the recommendations given in this discussion group.

We kept the setting as shard=2 and replica=2. This is where the question
begins. We would like to have an incremental Backup process so that if
there are any data corruption or accidental deletes or even if the whole
cluster goes down, we should be able to recover the data in full form.
Initially my plan was to do the following.

Do FS Gateway to a remote system. Say a separate machine which will
have 5 TB storage with RAID 5 and 15,000 rpm. (Is this doable?)

Do regular tape backup on this backup machine.

Restore the data to the Gateway

By doing this, if we find a problem, we will be able to rollback to a
certain point in time. My assumption here is that the FS Gateway will
persist both the state and data of the cluster. Is this assumption correct
and is this recommended? If not, what are the other ways to do a full data
recovery, I have replica:2 so the data will be present in 2 nodes all the
time and so I can not go with a single machine data backup. Please help me
with this.

Regards,
KS

--
Regards,
Berkay Mollamustafaoglu
Ph: +1 (571) 766-6292
mberkay on yahoo, google and skype

Paul_Smith · July 17, 2012, 3:27am

We use FS Gateway to an NFS mount on a different machine. Every night we
suspend the Gateway Snapshot via REST API, hard link copy the gateway
directory on this different machine temporarily, and then resume the
snapshotting. We then rsync this hardlink copy of the gateway back to a DR
location on the other side of the planet. So we now have a daily snapshot
ready to load in a different Data Center.

We then use Scrutineer (we wrote this):
https://github.com/Aconex/scrutineer to then be able to roll
forward/backward to sync it's state with the copy of the Database that is
using txn log shipping back to this same DR Data Center.

You can use Scrutineer on your live system to check for integrity errors
too, helps with detecting data mismatches, missing items etc. Beats a full
reindex that's for sure.

cheers,

Paul

On 17 July 2012 11:43, Berkay Mollamustafaoglu mberkay@gmail.com wrote:

A quick correction. If you have set replica to 2, you have 2 copies plus
the master hence total of 3 copies. So you have same data at all nodes. If
you want to have just one copy, set replica to just 1.

Berkay

On Monday, July 16, 2012, ElasticFan wrote:

Hi all,

We recently set up a big cluster where everyday we index around 50
million records cumulatively sized over 40 GB, we use 3 big machines, 128
GB RAM, dual hex core processors and 5 TB disk (only 7200 rpm but RAID 5).
We do an index per day as per the recommendations given in this discussion
group.

We kept the setting as shard=2 and replica=2. This is where the question
begins. We would like to have an incremental Backup process so that if
there are any data corruption or accidental deletes or even if the whole
cluster goes down, we should be able to recover the data in full form.
Initially my plan was to do the following.

Do FS Gateway to a remote system. Say a separate machine which
will have 5 TB storage with RAID 5 and 15,000 rpm. (Is this doable?)

Do regular tape backup on this backup machine.

Restore the data to the Gateway

By doing this, if we find a problem, we will be able to rollback to a
certain point in time. My assumption here is that the FS Gateway will
persist both the state and data of the cluster. Is this assumption correct
and is this recommended? If not, what are the other ways to do a full data
recovery, I have replica:2 so the data will be present in 2 nodes all the
time and so I can not go with a single machine data backup. Please help me
with this.

Regards,
KS

--
Regards,
Berkay Mollamustafaoglu
Ph: +1 (571) 766-6292
mberkay on yahoo, google and skype

ElasticFan · July 17, 2012, 9:05am

Hi Berkay,

You are correct. The replica was set to 1. Master and one replica, but 3
nodes in total.

Hi Paul,

Thank you so much. I was doing rsynch on the ES data directory as a
temporary backup solution. Could you please guide me in setting up the FS
Gateway correctly. I tried something and I could not restore the data.

Could you please show me what and all are the settings that you made
and in which config files?
Is incremental index possible on the FS Gateway snapshot?
Am I correct if I have understood that FS Gateway snapshot will have
both the state and also the data?
How to restore the data back in Cluster. Is this like, restore the
snapshot and then restart the cluster?
Are you using it in production if yes, do know the time it would take
for restoring a Cluster of 900GB data in total?

Sorry for firing away all these questions. Any help would be much
appreciated.

Thank you. This thread would help a lot of poor souls trying to find an
optimum disaster recovery solution.

Regards,
KS

On Tue, Jul 17, 2012 at 8:57 AM, Paul Smith tallpsmith@gmail.com wrote:

We use FS Gateway to an NFS mount on a different machine. Every night we
suspend the Gateway Snapshot via REST API, hard link copy the gateway
directory on this different machine temporarily, and then resume the
snapshotting. We then rsync this hardlink copy of the gateway back to a DR
location on the other side of the planet. So we now have a daily snapshot
ready to load in a different Data Center.

We then use Scrutineer (we wrote this):
https://github.com/Aconex/scrutineer to then be able to roll
forward/backward to sync it's state with the copy of the Database that is
using txn log shipping back to this same DR Data Center.

You can use Scrutineer on your live system to check for integrity errors
too, helps with detecting data mismatches, missing items etc. Beats a full
reindex that's for sure.

cheers,

Paul

On 17 July 2012 11:43, Berkay Mollamustafaoglu mberkay@gmail.com wrote:

A quick correction. If you have set replica to 2, you have 2 copies plus
the master hence total of 3 copies. So you have same data at all nodes. If
you want to have just one copy, set replica to just 1.

Berkay

On Monday, July 16, 2012, ElasticFan wrote:

Hi all,

We recently set up a big cluster where everyday we index around 50
million records cumulatively sized over 40 GB, we use 3 big machines, 128
GB RAM, dual hex core processors and 5 TB disk (only 7200 rpm but RAID 5).
We do an index per day as per the recommendations given in this discussion
group.

We kept the setting as shard=2 and replica=2. This is where the question
begins. We would like to have an incremental Backup process so that if
there are any data corruption or accidental deletes or even if the whole
cluster goes down, we should be able to recover the data in full form.
Initially my plan was to do the following.

Do FS Gateway to a remote system. Say a separate machine which
will have 5 TB storage with RAID 5 and 15,000 rpm. (Is this doable?)

Do regular tape backup on this backup machine.

Restore the data to the Gateway

By doing this, if we find a problem, we will be able to rollback to a
certain point in time. My assumption here is that the FS Gateway will
persist both the state and data of the cluster. Is this assumption correct
and is this recommended? If not, what are the other ways to do a full data
recovery, I have replica:2 so the data will be present in 2 nodes all the
time and so I can not go with a single machine data backup. Please help me
with this.

Regards,
KS

--
Regards,
Berkay Mollamustafaoglu
Ph: +1 (571) 766-6292
mberkay on yahoo, google and skype

Paul_Smith · July 18, 2012, 12:58am

Hi Paul,

Thank you so much. I was doing rsynch on the ES data directory as a
temporary backup solution. Could you please guide me in setting up the FS
Gateway correctly. I tried something and I could not restore the data.

Could you please show me what and all are the settings that you
made and in which config files?

The gateway setting is very simple...:

Gateway is shared FS

gateway:
type: fs
fs:
location: /mnt/esgateway/

/mnt/esgateway is an NFS mounted share to a separate physical host to all
the ES nodes. All ES nodes have this same share point.

Is incremental index possible on the FS Gateway snapshot?

No with a bit of yes, but it's not what you think it is. At the end of
the day the segments are files, and the incremental 'sync' difference is
relatively low since the smaller segments generally merge together, more
often than not leaving the bigger segments unchanged until a larger merge
happens, so the rsync tends to have good 'saving' in terms of not needing
to sync too many large files - all until a very large merge or an Optimize
is done, and then it's sort of all brand new files again.

So it's mostly no, it's always a 'full' sync, but there's lots of savings
there. Maybe you're asking a different question though.

Am I correct if I have understood that FS Gateway snapshot will
have both the state and also the data?

yes, the gateway includes cluster state (metadata) and the indices
directories

How to restore the data back in Cluster. Is this like, restore the
snapshot and then restart the cluster?

While there's no 'restore' tool (as yet), all we do in our DR is:

have the DR cluster shutdown
wipe clean the local data directory for each node
Ensure the DR cluster has a configuration with the FS Gateway pointed to
an NFS share with the replicated copy of the gateway
Start up the cluster

All the nodes now recover their state from the gateway. (we have multiple
DC's using this DR data centre as a location, so we deliberately purge any
local node state to ensure we get a clean recovery for the DC coming into
the DR location.

Are you using it in production if yes, do know the time it would
take for restoring a Cluster of 900GB data in total?

Yes we do. 900 Gb's a good size for sure. Ours are only in the
up-to-100-Gb mark. You'll have to do some of your own testing on that,
it'll be hardware/environment specific on how long that takes (Disk RAID
setup, network bandwidth, number of nodes for parallel recovery etc). At
the end of the day it's how fast you can transfer the shard contents to the
relevant nodes.

I'm guessing here actually (Shay or others could confirm?), but I believe
the Master 'delegates' the node to recover specific shards from the shared
gateway, so the central location will be hit from all nodes to recover
from, so that host is probably the limiting resource factor (Disk & Network
bandwidth on that node).

Paul Smith

Topic		Replies	Views
Questions regarding ElasticSearch backup Elasticsearch	5	1034	July 5, 2017
Big troubles upgrading elastic cluter Elasticsearch	4	281	March 15, 2021
Recovery from S3 gateway - only one shard recovers? Elasticsearch	10	456	July 6, 2017
ElasticSearch 1.0 Manual Backup Elasticsearch	4	568	July 6, 2017
Backup procedure for ES nodes Elasticsearch	2	319	July 6, 2017

Elastic Search Backup and Recovery

Gateway is shared FS

Related topics