How to dump the entire contents of ES?


(Daniel Maher) #1

Hello,

I am curious to know what - if any - is the best-practice method for
"dumping" the entire contents of Elasticsearch, say, for the purposes of
doing a snapshot backup.

Thank you.

--
Daniel Maher
« can't talk, too busy calculating computrons. »


(Shay Banon) #2

If you are using a shared gateway, then it is done automatically against a shared storage, which you can then backup by just copying it over. This includes only the primary storage (does not include replicas).

The local gateway (the default and recommended way to use elasticsearch) stores the data on the local drive. You can backup those, but it will also include replicas data.

In term of data dump of an index, the upcoming search_type scan will allow to iterate over a large result set.
On Thursday, March 31, 2011 at 1:35 PM, Daniel Maher wrote:

Hello,

I am curious to know what - if any - is the best-practice method for
"dumping" the entire contents of Elasticsearch, say, for the purposes of
doing a snapshot backup.

Thank you.

--
Daniel Maher
« can't talk, too busy calculating computrons. »


(Daniel Maher) #3

On Thu, 2011-03-31 at 17:57 +0200, Shay Banon wrote:

If you are using a shared gateway, then it is done automatically
against a shared storage, which you can then backup by just copying it
over. This includes only the primary storage (does not include
replicas).

The local gateway (the default and recommended way to use
elasticsearch) stores the data on the local drive. You can backup
those, but it will also include replicas data.

In term of data dump of an index, the upcoming search_type scan will
allow to iterate over a large result set.

Thank you for your rapid and informative reply !

As a follow-up question, I would love to hear your thoughts on what the
best way to maintain distinct development / production clusters would
be. Basically, we have a development cluster that we'd like to
synchronise with the production cluster every once in a while (in order
to let our programmers work with close-to-real data). We've gone
through a few different approaches to this, but we're not sure what the
safest / most efficient process would be; if you or anyone else on the
list has any insight, it would be greatly appreciated.

Thanks again.

--
Daniel Maher
« can't talk, too busy calculating computrons. »


(dbenson) #4

As a follow-up question, I would love to hear your thoughts on what the
best way to maintain distinct development / production clusters would

We have 5 clusters: development, acceptance and three production
clusters, each in a different data center (We've been live since Nov
1).

Each cluster has a name defined in the elasticsearch.yml file, we also
add the ES version. This way a machine can't join the wrong cluster.

cluster:
name: dev-0.15

Each cluster has our set of applications which index and query data.
The applications are in Java and built with the corresponding ES
versions. When upgrading ES, we usually reindex data. We pull a
machine from the cluster, give it a new name (w/ new version #), build
up content. Then during a planned maintenance window, where we've
removed client traffic from that data center, upgrade the other
machines in the cluster and have them pull content off the machine w/
rebuilt content. We're still in a situation where all index content
fits on a single machine.

The content to be indexed is processed in each environment, we just
keep less of it (ie Dev has 90 days, acc 180 days), so each
environment has the same data to work (and test against).

David


(Shay Banon) #5

Hey,

What was suggest here is great. Another approach is to simply make a copy of the data directory of each production cluster, and move it to development. Just make sure that even that copy will not create to big of IO load on the production system.

-shay.banon
On Friday, April 1, 2011 at 8:35 PM, dbenson wrote:

As a follow-up question, I would love to hear your thoughts on what the

best way to maintain distinct development / production clusters would

We have 5 clusters: development, acceptance and three production
clusters, each in a different data center (We've been live since Nov
1).

Each cluster has a name defined in the elasticsearch.yml file, we also
add the ES version. This way a machine can't join the wrong cluster.

cluster:
name: dev-0.15

Each cluster has our set of applications which index and query data.
The applications are in Java and built with the corresponding ES
versions. When upgrading ES, we usually reindex data. We pull a
machine from the cluster, give it a new name (w/ new version #), build
up content. Then during a planned maintenance window, where we've
removed client traffic from that data center, upgrade the other
machines in the cluster and have them pull content off the machine w/
rebuilt content. We're still in a situation where all index content
fits on a single machine.

The content to be indexed is processed in each environment, we just
keep less of it (ie Dev has 90 days, acc 180 days), so each
environment has the same data to work (and test against).

David


(dchancogne) #6

This is very interesting. Thanks for sharing this insight.

How do you keep the data on the new node in sync with the data added to the old cluster up to the switch? Do you index any new document in both places? If yes, what do you use to do that?

Thank you.


(system) #7