Index Backups to S3?

Bruno_Miranda · February 21, 2013, 2:39am

I have a 3 node cluster on ec2. All 3 nodes run as master eligible/data
notes. Default 1 replica and 5 shards.

I am wondering if backing up the index is necessary. If so, is S3 a good
place to put it?

Our entire index can be recreated from MySQL in about 12 hours. Can you
guys please point me in the right direction?

Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

karmi · February 21, 2013, 8:19am

On EC2, I'd say the best backup option is an EBS snapshot -- if you're
using EBS for ES persistence, that is.

The recommended, general backup/restore strategy right now is to use
tar+scp/rsync/etc to offload the whole data directory somewhere else. That
somewhere could well be S3, you can script it with the Fog gem 1. Maybe
you can reuse ideas or code from the Backup gem 2.

Karel

On Thursday, February 21, 2013 3:39:37 AM UTC+1, Bruno Miranda wrote:

I have a 3 node cluster on ec2. All 3 nodes run as master eligible/data
notes. Default 1 replica and 5 shards.

I am wondering if backing up the index is necessary. If so, is S3 a good
place to put it?

Our entire index can be recreated from MySQL in about 12 hours. Can you
guys please point me in the right direction?

Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

karmi · February 21, 2013, 8:22am

Maybe you can reuse ideas or code from the Backup gem [2].

Specifically the "Amazon S3" section on the syncers page:
Syncers · backup/backup Wiki · GitHub looks intriguing.

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bruno_Miranda · February 21, 2013, 6:11pm

Any reason why I should not use S3 Gateway?

On Thursday, February 21, 2013 12:22:40 AM UTC-8, Karel Minařík wrote:

Maybe you can reuse ideas or code from the Backup gem [2].

Specifically the "Amazon S3" section on the syncers page:
Syncers · backup/backup Wiki · GitHub looks intriguing.

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

karmi · February 21, 2013, 6:25pm

Any reason why I should not use S3 Gateway?

Yes: it's deprecated and will be removed.

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · February 21, 2013, 6:25pm

S3 gateway has been deprecated:

github.com/elastic/elasticsearch

Deprecate Shared Gateway

opened 10:43AM - 03 Dec 12 UTC

closed 10:44AM - 03 Dec 12 UTC

kimchy

>breaking v0.20.0 v0.90.0.Beta1 v0.19.12

Shared gateways (shared FS storage or S3 for example) are problematic performanc…e wise since they constantly need to snapshot the state of the index to a shared location, and then use that as the system of record. The local gateway on the other hand doesn't need it, and performs much better. The main benefit of a shared gateway is the fact that the data is actually stored on another persistent location (i.e. using ephemeral disks on AWS, but still having the data on s3), but then its actually abusing the shared gateway design (to be used as a backup). In the near future, we will have a proper snapshot(backup)/restore API, which will be the proper way to do backups, but relaying on the shared gateway for that is problematic. Note, backups can still be made by "rsync" the data location for each node "manually".

On Thu, Feb 21, 2013 at 10:11 AM, Bruno Miranda bru.miranda@gmail.comwrote:

Any reason why I should not use S3 Gateway?

Elasticsearch Platform — Find real-time answers at scale | Elastic

On Thursday, February 21, 2013 12:22:40 AM UTC-8, Karel Minařík wrote:

Maybe you can reuse ideas or code from the Backup gem [2].

Specifically the "Amazon S3" section on the syncers page:
https://github.com/**meskyanichi/backup/wiki/**Syncers https://github.com/meskyanichi/backup/wiki/Syncers looks
intriguing.

Karel

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Nick_Zadrozny · February 21, 2013, 8:01pm

On Wed, Feb 20, 2013 at 7:39 PM, Bruno Miranda bru.miranda@gmail.comwrote:

I am wondering if backing up the index is necessary. If so, is S3 a good
place to put it?

Here's what I would recommend. It's based on what we do for backups at
http://bonsai.io/ and is an alternative to EBS snapshots, which are a
pretty reasonable approach if you're serving your data from a single EBS
volume. (I think there are arguments for not using EBS; another topic.)

First, cp -lr your Elasticsearch data directory for a quick, cheap
filesystem snapshot. This is one of those arcane bits of Unix knowledge
that I learned once and understand intuitively, but probably would do a bad
job explaining in detail, so consult man(1). Effectively, you get a cheap,
instant copy and only pay the disk space for the delta as your original
changes.

Incidentally, cp -lr is cheap and useful enough that we use it to
snapshot our data on every deploy, just in case.

If your data is on EBS, I would first rsync -a that snapshot over to
the ephemeral store. This presumes you have enough space on your ephemeral
store, which is a good constraint to consider when designing your cluster.

An up-to-date ephemeral copy gives you some fairly cheap insurance when
(not if) your EBS volume gets stuck. You can just change your data
directory and restart the cluster. It should also save you some iops
against your production EBS volume traffic while you're running your backup
to S3.

From the data snapshot, or rsync'd copy in your ephemeral store, you can
use something akin to s3sync to send your data over to S3. We wrote a
custom implementation; the backup gem that Karel linked looks reasonable
too. We're also syncing into a "rolling window" of S3 buckets per daily
backup, with a directory per host, since our main story for full backups is
recovering from a customer's own accidental deletion.

When considering availability in AWS, I have these priorities:

Replicate every index in at least two Availability Zones. You should be
able to completely toss an entire AZ worth of instances, and their data,
without causing an outage. This is where you earn your sleep at night when
you're hosting on AWS.
Recovery from a botched deploy. Maybe a major version upgrade goes
sideways (we were bit hard during our beta when I messed up our 0.18 to
0.19 upgrade). A cheap snapshot lets you roll back your cluster state and
data with a couple filesystem operations.
Recovery from a total cluster loss by syncing back from S3. In practice,
we use our S3 backups more often to help a customer save time recovering
from an accidental deletion. And if we ever have an entire cluster outage,
asking all our customers to reindex is a big no-no for us.

When it's just your data, and you know how long it takes to reindex from
scratch, the cost-benefit analysis on reindexing versus restoring is up to
you. You probably want a restore from S3 to save you a couple hours in
order to be worth the effort.

Then again, this is all the kind of thing you should probably just do
anyway and assume your future self will appreciate. Or, you know, use
Bonsai

--
Nick Zadrozny

Cofounder, One More Cloud

websolr.com https://websolr.com/home • bonsai.io http://bonsai.io/home

Hassle-free hosted full-text search,
powered by Apache Solr and Elasticsearch.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bruno_Miranda · February 21, 2013, 8:27pm

Excellent answer. Thank you.

On Thursday, February 21, 2013 12:01:32 PM UTC-8, Nick Zadrozny wrote:

On Wed, Feb 20, 2013 at 7:39 PM, Bruno Miranda <bru.m...@gmail.com<javascript:>

wrote:

I am wondering if backing up the index is necessary. If so, is S3 a good
place to put it?

Here's what I would recommend. It's based on what we do for backups at
http://bonsai.io/ and is an alternative to EBS snapshots, which are a
pretty reasonable approach if you're serving your data from a single EBS
volume. (I think there are arguments for not using EBS; another topic.)

First, cp -lr your Elasticsearch data directory for a quick, cheap
filesystem snapshot. This is one of those arcane bits of Unix knowledge
that I learned once and understand intuitively, but probably would do a bad
job explaining in detail, so consult man(1). Effectively, you get a cheap,
instant copy and only pay the disk space for the delta as your original
changes.

Incidentally, cp -lr is cheap and useful enough that we use it to
snapshot our data on every deploy, just in case.

If your data is on EBS, I would first rsync -a that snapshot over to
the ephemeral store. This presumes you have enough space on your ephemeral
store, which is a good constraint to consider when designing your cluster.

An up-to-date ephemeral copy gives you some fairly cheap insurance when
(not if) your EBS volume gets stuck. You can just change your data
directory and restart the cluster. It should also save you some iops
against your production EBS volume traffic while you're running your backup
to S3.

From the data snapshot, or rsync'd copy in your ephemeral store, you can
use something akin to s3sync to send your data over to S3. We wrote a
custom implementation; the backup gem that Karel linked looks reasonable
too. We're also syncing into a "rolling window" of S3 buckets per daily
backup, with a directory per host, since our main story for full backups is
recovering from a customer's own accidental deletion.

When considering availability in AWS, I have these priorities:

Replicate every index in at least two Availability Zones. You should be
able to completely toss an entire AZ worth of instances, and their data,
without causing an outage. This is where you earn your sleep at night when
you're hosting on AWS.

Recovery from a botched deploy. Maybe a major version upgrade goes
sideways (we were bit hard during our beta when I messed up our 0.18 to
0.19 upgrade). A cheap snapshot lets you roll back your cluster state and
data with a couple filesystem operations.

Recovery from a total cluster loss by syncing back from S3. In
practice, we use our S3 backups more often to help a customer save time
recovering from an accidental deletion. And if we ever have an entire
cluster outage, asking all our customers to reindex is a big no-no for us.

When it's just your data, and you know how long it takes to reindex from
scratch, the cost-benefit analysis on reindexing versus restoring is up to
you. You probably want a restore from S3 to save you a couple hours in
order to be worth the effort.

Then again, this is all the kind of thing you should probably just do
anyway and assume your future self will appreciate. Or, you know, use
Bonsai

--
Nick Zadrozny

Cofounder, One More Cloud

websolr.com https://websolr.com/home • bonsai.io http://bonsai.io/home

Hassle-free hosted full-text search,
powered by Apache Solr and Elasticsearch.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
How to add ec2 s3 or other gateway after index is created? Elasticsearch	8	425	July 6, 2017
Running on EC2 S3 vs EBS Elasticsearch	3	814	July 6, 2017
Question about s3 gateway vs EBS Elasticsearch	8	434	July 6, 2017
So, with the deprecation of S3 gateway, what is the current best approach to cluster persistence? Elasticsearch	6	406	July 6, 2017
Gateway snapshots and thinking of a DR site Elasticsearch	14	561	July 6, 2017

Index Backups to S3?

Related topics