Problems manually copying a set of indices from a 3-node cluster to a 1-node cluster

Previous posts have suggested that it's possible to backup/migrate/
restore indexes from one cluster to another via manual copy. I tested
this yesterday and ran into some problems.

Setup:
My test/reference cluster ("infinite-aws") consists of 3 nodes (on
AWS, default gateway settings). It has 8 indices: 3 of which are 1
shard with 2 replicas, and 5 of which are 5 (3/5) or 10 (2/5) shards
each with 1 replica.
I wanted to test copying this set of indices into a different cluster
("infinite-dev") consisting of just 1 node.
I am running 0.16.2

Steps:
I created a tar of the "/data/infinite-aws" directory (with
"disable translog flush" set to true) on one of the 3 "infinite-aws"
nodes, and scp'd it across to "infinite-dev".
I then ran the following script (ES stopped):

tar xvf index_backup_most_recent.tar
rm -rf data/infinite-dev
mv data/infinite-aws data/infinite-dev

And restarted ES. At this point I obviously expect the status to be
red since I have too many replicas, so I run:
curl -XPUT 'http://localhost:9200/_all/_settings' -d '{ "index":
{ "number_of_replicas": 0 } }'

(I also re-enabled the translog flushing and deleted the node.lock,
though I assume that they are reset by restarting ES anyway)

This gets all but 2 indices working (1x 5 shard and 1x 10 shard). The
5-shard index has 2 shards remain unassigned, and the 10-shard index
has all shards remain unassigned. The status obviously remains red.

Looking at the status for those 2 indices (eg from the overview page
of ES-head), I note that the unassigned shards are not listed (eg the
status for the 10-shard index just reports "null"). In the overall
cluster health, the missing shards are listed as unassigned, with all
relocating fields set to null.

Looking at the distribution of the failing indices in the original
cluster, I note that 1 of the 3 nodes (in fact the master) has 1
replica of every shard of every index apart from the 12 failed
shards across the 2 indexes (which are distributed across the other 2
nodes - as an aside, is that slightly dubious balancing expected?). So
this would appear to be the root of the problem.

I tried opening and closing the indices and restarting the node and
just waiting a long time, all to no avail.

So:

  • Is what I'm trying to do (I guess specifically the bit about
    copying by hand from a 3-node cluster to a 1-node cluster) supported?
  • If so, are any migration steps missing/wrong?
  • If this flavor of manual migration is not supported between
    clusters of different sizes, is there an alternative?

*** Some more details:

The INFO log reports nothing interesting, eg:

2011-06-14 14:57:18.104 [INFO] gateway:79 - [Spiral] recovered [8]
indices into cluster_state
2011-06-14 15:04:15.552 [INFO] cluster.metadata:79 - [Spiral] Updating
number_of_replicas to [0] for indices [doc_4dd53fb4e40d93afb096c484,
event_index, doc_4c927585d591d31d7c37097b, document_index, doc_dummy,
gazetteer_index, doc_4db5c05fb246d25364aceca0,
doc_4c927585d591d31d7b37097a]

(doc_4db5c05fb246d25364aceca0, doc_4c927585d591d31d7b37097a are the 2
failing indices)

In DEBUG you get in addition reports like:

2011-06-15 08:35:19.782 [DEBUG] gateway.local:71 - [Guido Carosella]
[doc_4db5c05fb246d25364aceca0][4]: not allocating,
number_of_allocated_shards_found [0], required_number [1]
2011-06-15 08:35:19.782 [DEBUG] gateway.local:71 - [Guido Carosella]
[doc_4c927585d591d31d7b37097a][0]: not allocating,
number_of_allocated_shards_found [0], required_number [1]

I can provide any other details that night be helpful.

Thanks as always, for anyone/everyone's insight,

Alex

(I'm always hoping to see a question I can answer, so I can help out,
but someone - usually Shay - always beats me to it!)

Hi Alex

Setup:
My test/reference cluster ("infinite-aws") consists of 3 nodes (on
AWS, default gateway settings). It has 8 indices: 3 of which are 1
shard with 2 replicas, and 5 of which are 5 (3/5) or 10 (2/5) shards
each with 1 replica.

The problem is that, with 3 nodes, you can't be sure which shards exist
on which nodes, unless all of your indices are set to have 2 replicas
(ie 1 primary + 2 replicas = 3 shards, therefore all 3 nodes would have
a copy of the shard).

So you have a few options available to you:

  1. Before copying,

    • Increase the number of replicas for each of your
      indices to 2.
    • Wait for cluster health to return to green.
      Then copy the data/ dir from any one of your nodes.
  2. Before copying,

    • stop one of your nodes
    • wait for cluster health to report relocating_shards: 0
      Then copy the data/ dir from any one of your nodes.
  3. Copy all of the data dirs from each of your nodes into your
    dev environment, as $ES_HOME/data/$CLUSTER_NAME/nodes/$NODE
    where $NODE would be 0, 1 or 2.

    Start 3 nodes on the dev machine, wait for them to recover,
    then shut down one node, wait for relocating_shards: 0
    then shut down the last node. Now you should have
    all of your data in $ES_HOME/data/$CLUSTER_NAME/nodes/0

clint

Thanks - I figured it was something like that .... but:

Don't all nodes keep a file copy of all shards - isn't that what the
local gateway is doing for me?

Eg Elasticsearch Platform — Find real-time answers at scale | Elastic
"The local gateway allows for recovery of the full cluster state and
indices from the local storage of each node, and does not require a
common node level shared storage"

Looking in /data/infinite-dev/nodes/0/indices//0/
index/, it appears to have all the data in it..... presumably only the
metadata-, shard- are different across nodes?

So if I have a copy of every shard, is there really no way of getting
them to load? (I'm sure in 0.15 the metadata/shard state files used to
be human editable but now they're distinctly binary :frowning: )

On Jun 15, 9:15 am, Clinton Gormley clin...@iannounce.co.uk wrote:

Hi Alex

Setup:
My test/reference cluster ("infinite-aws") consists of 3 nodes (on
AWS, default gateway settings). It has 8 indices: 3 of which are 1
shard with 2 replicas, and 5 of which are 5 (3/5) or 10 (2/5) shards
each with 1 replica.

The problem is that, with 3 nodes, you can't be sure which shards exist
on which nodes, unless all of your indices are set to have 2 replicas
(ie 1 primary + 2 replicas = 3 shards, therefore all 3 nodes would have
a copy of the shard).

Hiya

Don't all nodes keep a file copy of all shards - isn't that what the
local gateway is doing for me?

Eg Elasticsearch Platform — Find real-time answers at scale | Elastic
"The local gateway allows for recovery of the full cluster state and
indices from the local storage of each node, and does not require a
common node level shared storage"

This is true as long as you start up the same number of nodes again :slight_smile:

It doesn't store the data for every shard on every node, unless the
replicas setting requries that.

Looking in /data/infinite-dev/nodes/0/indices//0/
index/, it appears to have all the data in it..... presumably only the
metadata-, shard- are different across nodes?

Hmm, I wonder if it is out of date. I think it may have an old copy of
shards that it once hosted. Not sure.

clint

the local gateway on a node only keeps a copy of shards that are allocated on that specific node, not of all the shards on all nodes.

On Wednesday, June 15, 2011 at 4:43 PM, Alex at Ikanow wrote:

Thanks - I figured it was something like that .... but:

Don't all nodes keep a file copy of all shards - isn't that what the
local gateway is doing for me?

Eg Elasticsearch Platform — Find real-time answers at scale | Elastic
"The local gateway allows for recovery of the full cluster state and
indices from the local storage of each node, and does not require a
common node level shared storage"

Looking in /data/infinite-dev/nodes/0/indices//0/
index/, it appears to have all the data in it..... presumably only the
metadata-, shard- are different across nodes?

So if I have a copy of every shard, is there really no way of getting
them to load? (I'm sure in 0.15 the metadata/shard state files used to
be human editable but now they're distinctly binary :frowning: )

On Jun 15, 9:15 am, Clinton Gormley <clin...@iannounce.co.uk (http://iannounce.co.uk)> wrote:

Hi Alex

Setup:
My test/reference cluster ("infinite-aws") consists of 3 nodes (on
AWS, default gateway settings). It has 8 indices: 3 of which are 1
shard with 2 replicas, and 5 of which are 5 (3/5) or 10 (2/5) shards
each with 1 replica.

The problem is that, with 3 nodes, you can't be sure which shards exist
on which nodes, unless all of your indices are set to have 2 replicas
(ie 1 primary + 2 replicas = 3 shards, therefore all 3 nodes would have
a copy of the shard).

OK now I'm confused!

Eg Elasticsearch Platform — Find real-time answers at scale | Elastic
"The local gateway allows for recovery of the full cluster state and
indices from the local storage of each node, and does not require a
common node level shared storage"

This is true as long as you start up the same number of nodes again :slight_smile:

It doesn't store the data for every shard on every node, unless the
replicas setting requries that.

"full cluster state and indices from the local storage of each
node"

Which heavily implies you could lose the entire cluster except for 1
node and you'd be able to resurrect the entire cluster. If I have to
start extra nodes locally, that's fine, but I should be able to do so
from the data from just one node, no?

If that's not the case then I should be using an S3 gateway or
something like that in case multiple nodes die (currently I use a
local gateway and then create backups from one of the nodes)...

FWIW I just started up a second node with the existing data and it
didn't assign any new shards, it just split the working ones between
the 2. So empirically what you are saying is true, and that would
definitely kill off the above backup strategy, correct?

Thanks again for all the insight.

Alex

I don't think it implies it. The recoverability or resistance to failure when using local gateway is derived by the number of replicas you have.

On Wednesday, June 15, 2011 at 5:20 PM, Alex at Ikanow wrote:

OK now I'm confused!

Eg Elasticsearch Platform — Find real-time answers at scale | Elastic
"The local gateway allows for recovery of the full cluster state and
indices from the local storage of each node, and does not require a
common node level shared storage"

This is true as long as you start up the same number of nodes again :slight_smile:

It doesn't store the data for every shard on every node, unless the
replicas setting requries that.

"full cluster state and indices from the local storage of each
node"

Which heavily implies you could lose the entire cluster except for 1
node and you'd be able to resurrect the entire cluster. If I have to
start extra nodes locally, that's fine, but I should be able to do so
from the data from just one node, no?

If that's not the case then I should be using an S3 gateway or
something like that in case multiple nodes die (currently I use a
local gateway and then create backups from one of the nodes)...

FWIW I just started up a second node with the existing data and it
didn't assign any new shards, it just split the working ones between
the 2. So empirically what you are saying is true, and that would
definitely kill off the above backup strategy, correct?

Thanks again for all the insight.

Alex

Ah OK, that's pretty definitive :slight_smile:

Can I suggest the wording I quoted from the website documentation be
amended? I certainly misinterpreted it as meaning each node could be
used to restore the full cluster state.

So what is the recommended strategy for making periodic backups of the
entire cluster? Is it to use a centralized gateway (such as S3) and
then snapshot that periodically? Using local gateway and then making
snapshots of each node (I was only backing up one node) seems like a
very bad approach to me.

(I think you advocate the former strategy in an old blog post:
Elasticsearch Platform — Find real-time answers at scale | Elastic,
but that looks a bit out-of-date, eg it describes EBS as a shared file
system, and doesn't discuss S3 or local at all)

On Jun 15, 10:02 am, Shay Banon shay.ba...@elasticsearch.com wrote:

the local gateway on a node only keeps a copy of shards that are allocated on that specific node, not of all the shards on all nodes.

On Wednesday, June 15, 2011 at 5:29 PM, Alex at Ikanow wrote:

Ah OK, that's pretty definitive :slight_smile:

Can I suggest the wording I quoted from the website documentation be
amended? I certainly misinterpreted it as meaning each node could be
used to restore the full cluster state.
Sure, updates the docs are always welcomed!

So what is the recommended strategy for making periodic backups of the
entire cluster? Is it to use a centralized gateway (such as S3) and
then snapshot that periodically? Using local gateway and then making
snapshots of each node (I was only backing up one node) seems like a
very bad approach to me.
You can use s3, and then backup it periodically. Note that s3 comes with an overhead (copying the changes to s3), and most systems I know running ES are using it with local gateway because of the bad IO AWS has.

If you want to do a backup, then you need to backup all the data in all the nodes when using local gateway. Thats the safest approach.

(I think you advocate the former strategy in an old blog post:
Elasticsearch Platform — Find real-time answers at scale | Elastic,
but that looks a bit out-of-date, eg it describes EBS as a shared file
system, and doesn't discuss S3 or local at all)

On Jun 15, 10:02 am, Shay Banon <shay.ba...@elasticsearch.com (http://elasticsearch.com)> wrote:

the local gateway on a node only keeps a copy of shards that are allocated on that specific node, not of all the shards on all nodes.

Sure, updates the docs are always welcomed!

I proposed a few extra paragraphs covering this and a couple of the
other questions I had to bother you/the mailing list with. I hope
they're useful!

Thanks!, will check it out.

On Wednesday, June 15, 2011 at 6:54 PM, Alex at Ikanow wrote:

Sure, updates the docs are always welcomed!

I proposed a few extra paragraphs covering this and a couple of the
other questions I had to bother you/the mailing list with. I hope
they're useful!