Index Backups to S3?

I have a 3 node cluster on ec2. All 3 nodes run as master eligible/data
notes. Default 1 replica and 5 shards.

I am wondering if backing up the index is necessary. If so, is S3 a good
place to put it?

Our entire index can be recreated from MySQL in about 12 hours. Can you
guys please point me in the right direction?

Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On EC2, I'd say the best backup option is an EBS snapshot -- if you're
using EBS for ES persistence, that is.

The recommended, general backup/restore strategy right now is to use
tar+scp/rsync/etc to offload the whole data directory somewhere else. That
somewhere could well be S3, you can script it with the Fog gem 1. Maybe
you can reuse ideas or code from the Backup gem 2.

Karel

On Thursday, February 21, 2013 3:39:37 AM UTC+1, Bruno Miranda wrote:

I have a 3 node cluster on ec2. All 3 nodes run as master eligible/data
notes. Default 1 replica and 5 shards.

I am wondering if backing up the index is necessary. If so, is S3 a good
place to put it?

Our entire index can be recreated from MySQL in about 12 hours. Can you
guys please point me in the right direction?

Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Maybe you can reuse ideas or code from the Backup gem [2].

Specifically the "Amazon S3" section on the syncers page:
https://github.com/meskyanichi/backup/wiki/Syncers looks intriguing.

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Any reason why I should not use S3 Gateway?

http://www.elasticsearch.org/guide/reference/modules/gateway/s3.html

On Thursday, February 21, 2013 12:22:40 AM UTC-8, Karel Minařík wrote:

Maybe you can reuse ideas or code from the Backup gem [2].

Specifically the "Amazon S3" section on the syncers page:
https://github.com/meskyanichi/backup/wiki/Syncers looks intriguing.

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Any reason why I should not use S3 Gateway?

Yes: it's deprecated and will be removed.

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

S3 gateway has been deprecated:

On Thu, Feb 21, 2013 at 10:11 AM, Bruno Miranda bru.miranda@gmail.comwrote:

Any reason why I should not use S3 Gateway?

http://www.elasticsearch.org/guide/reference/modules/gateway/s3.html

On Thursday, February 21, 2013 12:22:40 AM UTC-8, Karel Minařík wrote:

Maybe you can reuse ideas or code from the Backup gem [2].

Specifically the "Amazon S3" section on the syncers page:
https://github.com/**meskyanichi/backup/wiki/**Syncershttps://github.com/meskyanichi/backup/wiki/Syncers looks
intriguing.

Karel

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Wed, Feb 20, 2013 at 7:39 PM, Bruno Miranda bru.miranda@gmail.comwrote:

I am wondering if backing up the index is necessary. If so, is S3 a good
place to put it?

Here's what I would recommend. It's based on what we do for backups at
http://bonsai.io/ and is an alternative to EBS snapshots, which are a
pretty reasonable approach if you're serving your data from a single EBS
volume. (I think there are arguments for not using EBS; another topic.)

First, cp -lr your Elasticsearch data directory for a quick, cheap
filesystem snapshot. This is one of those arcane bits of Unix knowledge
that I learned once and understand intuitively, but probably would do a bad
job explaining in detail, so consult man(1). Effectively, you get a cheap,
instant copy and only pay the disk space for the delta as your original
changes.

Incidentally, cp -lr is cheap and useful enough that we use it to
snapshot our data on every deploy, just in case.

If your data is on EBS, I would first rsync -a that snapshot over to
the ephemeral store. This presumes you have enough space on your ephemeral
store, which is a good constraint to consider when designing your cluster.

An up-to-date ephemeral copy gives you some fairly cheap insurance when
(not if) your EBS volume gets stuck. You can just change your data
directory and restart the cluster. It should also save you some iops
against your production EBS volume traffic while you're running your backup
to S3.

From the data snapshot, or rsync'd copy in your ephemeral store, you can
use something akin to s3sync to send your data over to S3. We wrote a
custom implementation; the backup gem that Karel linked looks reasonable
too. We're also syncing into a "rolling window" of S3 buckets per daily
backup, with a directory per host, since our main story for full backups is
recovering from a customer's own accidental deletion.

When considering availability in AWS, I have these priorities:

  1. Replicate every index in at least two Availability Zones. You should be
    able to completely toss an entire AZ worth of instances, and their data,
    without causing an outage. This is where you earn your sleep at night when
    you're hosting on AWS.

  2. Recovery from a botched deploy. Maybe a major version upgrade goes
    sideways (we were bit hard during our beta when I messed up our 0.18 to
    0.19 upgrade). A cheap snapshot lets you roll back your cluster state and
    data with a couple filesystem operations.

  3. Recovery from a total cluster loss by syncing back from S3. In practice,
    we use our S3 backups more often to help a customer save time recovering
    from an accidental deletion. And if we ever have an entire cluster outage,
    asking all our customers to reindex is a big no-no for us.

When it's just your data, and you know how long it takes to reindex from
scratch, the cost-benefit analysis on reindexing versus restoring is up to
you. You probably want a restore from S3 to save you a couple hours in
order to be worth the effort.

Then again, this is all the kind of thing you should probably just do
anyway and assume your future self will appreciate. Or, you know, use
Bonsai :wink:

--
Nick Zadrozny

Cofounder, One More Cloud

websolr.com https://websolr.com/home • bonsai.io http://bonsai.io/home

Hassle-free hosted full-text search,
powered by Apache Solr and ElasticSearch.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Excellent answer. Thank you.

On Thursday, February 21, 2013 12:01:32 PM UTC-8, Nick Zadrozny wrote:

On Wed, Feb 20, 2013 at 7:39 PM, Bruno Miranda <bru.m...@gmail.com<javascript:>

wrote:

I am wondering if backing up the index is necessary. If so, is S3 a good
place to put it?

Here's what I would recommend. It's based on what we do for backups at
http://bonsai.io/ and is an alternative to EBS snapshots, which are a
pretty reasonable approach if you're serving your data from a single EBS
volume. (I think there are arguments for not using EBS; another topic.)

First, cp -lr your Elasticsearch data directory for a quick, cheap
filesystem snapshot. This is one of those arcane bits of Unix knowledge
that I learned once and understand intuitively, but probably would do a bad
job explaining in detail, so consult man(1). Effectively, you get a cheap,
instant copy and only pay the disk space for the delta as your original
changes.

Incidentally, cp -lr is cheap and useful enough that we use it to
snapshot our data on every deploy, just in case.

If your data is on EBS, I would first rsync -a that snapshot over to
the ephemeral store. This presumes you have enough space on your ephemeral
store, which is a good constraint to consider when designing your cluster.

An up-to-date ephemeral copy gives you some fairly cheap insurance when
(not if) your EBS volume gets stuck. You can just change your data
directory and restart the cluster. It should also save you some iops
against your production EBS volume traffic while you're running your backup
to S3.

From the data snapshot, or rsync'd copy in your ephemeral store, you can
use something akin to s3sync to send your data over to S3. We wrote a
custom implementation; the backup gem that Karel linked looks reasonable
too. We're also syncing into a "rolling window" of S3 buckets per daily
backup, with a directory per host, since our main story for full backups is
recovering from a customer's own accidental deletion.

When considering availability in AWS, I have these priorities:

  1. Replicate every index in at least two Availability Zones. You should be
    able to completely toss an entire AZ worth of instances, and their data,
    without causing an outage. This is where you earn your sleep at night when
    you're hosting on AWS.

  2. Recovery from a botched deploy. Maybe a major version upgrade goes
    sideways (we were bit hard during our beta when I messed up our 0.18 to
    0.19 upgrade). A cheap snapshot lets you roll back your cluster state and
    data with a couple filesystem operations.

  3. Recovery from a total cluster loss by syncing back from S3. In
    practice, we use our S3 backups more often to help a customer save time
    recovering from an accidental deletion. And if we ever have an entire
    cluster outage, asking all our customers to reindex is a big no-no for us.

When it's just your data, and you know how long it takes to reindex from
scratch, the cost-benefit analysis on reindexing versus restoring is up to
you. You probably want a restore from S3 to save you a couple hours in
order to be worth the effort.

Then again, this is all the kind of thing you should probably just do
anyway and assume your future self will appreciate. Or, you know, use
Bonsai :wink:

--
Nick Zadrozny

Cofounder, One More Cloud

websolr.com https://websolr.com/home • bonsai.io http://bonsai.io/home

Hassle-free hosted full-text search,
powered by Apache Solr and ElasticSearch.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.