Replicating all data to a single node


(snowmonkey) #1

We have a cluster with 3 nodes (with 1 replica and 5 shards) and we'd
like to create a 4th node which will have a complete copy of all the
data. Is this possible? And if so how do we best achieve it? We only
want the complete copy on the new 4th node (not all nodes).

So perhaps having 2 replicas, but where I can specify the location of
one of the replicas?

FYI - the purpose of this is to create a 'hot backup' of the data
which we can use, should a release go bad and we need to roll-back.


(Berkay Mollamustafaoglu-2) #2

Afaik it is not possible to control where a replica goes. You can control
which nodes an index should go using shard allocation via tags, which you
may be able to create a config that may work.

More importantly, I don't think this would be a good way to have a hot
backup. Everything that happens to primary would also happen to replicas,
so if something goes wrong, it would go wrong with the replicas as well.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Jan 23, 2012 at 12:06 PM, snowmonkey npomfret@gmail.com wrote:

We have a cluster with 3 nodes (with 1 replica and 5 shards) and we'd
like to create a 4th node which will have a complete copy of all the
data. Is this possible? And if so how do we best achieve it? We only
want the complete copy on the new 4th node (not all nodes).

So perhaps having 2 replicas, but where I can specify the location of
one of the replicas?

FYI - the purpose of this is to create a 'hot backup' of the data
which we can use, should a release go bad and we need to roll-back.


(snowmonkey) #3

If we had a backup node with all the data on, we'd exclude this node
when doing a release. Then if the release goes well it would start
building a brand new backup node (and the old one would becomes
obsolete and eventually deleted).

If the release does not go well (say the index data gets corrupted)
we'd discard the bad index nodes (and therefor all the corrupt data)
use the backup node from the previous release to re-populate the index
via regular replication.

However, if as you say we can't specify where shards get allocated
then we'll need to come up with a different solution. I thought it
might be possible because it must be possible to specify rules about
having a copy of the data in different data centers... or have a made
another incorrect assumption there?

On Jan 23, 6:02 pm, Berkay Mollamustafaoglu mber...@gmail.com wrote:

Afaik it is not possible to control where a replica goes. You can control
which nodes an index should go using shard allocation via tags, which you
may be able to create a config that may work.

More importantly, I don't think this would be a good way to have a hot
backup. Everything that happens to primary would also happen to replicas,
so if something goes wrong, it would go wrong with the replicas as well.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Jan 23, 2012 at 12:06 PM, snowmonkey npomf...@gmail.com wrote:

We have a cluster with 3 nodes (with 1 replica and 5 shards) and we'd
like to create a 4th node which will have a complete copy of all the
data. Is this possible? And if so how do we best achieve it? We only
want the complete copy on the new 4th node (not all nodes).

So perhaps having 2 replicas, but where I can specify the location of
one of the replicas?

FYI - the purpose of this is to create a 'hot backup' of the data
which we can use, should a release go bad and we need to roll-back.


(Berkay Mollamustafaoglu-2) #4

ES currently does not support multiple data center deployment scenarios,
like you'd have in Cassandra, etc.
I understand the flow you're describing, I don't think there is an option
at the moment to store the replicas on a particular node, etc. It sounds
like an enhancement to shard allocation feature, where one can tag which
servers should be used for primary indices and which for replicas, etc.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Jan 23, 2012 at 1:44 PM, snowmonkey npomfret@gmail.com wrote:

If we had a backup node with all the data on, we'd exclude this node
when doing a release. Then if the release goes well it would start
building a brand new backup node (and the old one would becomes
obsolete and eventually deleted).

If the release does not go well (say the index data gets corrupted)
we'd discard the bad index nodes (and therefor all the corrupt data)
use the backup node from the previous release to re-populate the index
via regular replication.

However, if as you say we can't specify where shards get allocated
then we'll need to come up with a different solution. I thought it
might be possible because it must be possible to specify rules about
having a copy of the data in different data centers... or have a made
another incorrect assumption there?

On Jan 23, 6:02 pm, Berkay Mollamustafaoglu mber...@gmail.com wrote:

Afaik it is not possible to control where a replica goes. You can control
which nodes an index should go using shard allocation via tags, which you
may be able to create a config that may work.

More importantly, I don't think this would be a good way to have a hot
backup. Everything that happens to primary would also happen to replicas,
so if something goes wrong, it would go wrong with the replicas as well.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Jan 23, 2012 at 12:06 PM, snowmonkey npomf...@gmail.com wrote:

We have a cluster with 3 nodes (with 1 replica and 5 shards) and we'd
like to create a 4th node which will have a complete copy of all the
data. Is this possible? And if so how do we best achieve it? We only
want the complete copy on the new 4th node (not all nodes).

So perhaps having 2 replicas, but where I can specify the location of
one of the replicas?

FYI - the purpose of this is to create a 'hot backup' of the data
which we can use, should a release go bad and we need to roll-back.


(Shay Banon) #5

You can specify where a replica shard will be allocated, or more
specifically, do something similar to rack aware deployment (even forced
one), see more here
http://www.elasticsearch.org/guide/reference/modules/cluster.html (under
allocation awareness).

But, you are after a greater separation between a shard and a replica then
is provided. If you want to backup data, you can simply backup the data
location of each node. Hopefully, in the future, we will have a
backup/restore API in place.

On Mon, Jan 23, 2012 at 8:44 PM, snowmonkey npomfret@gmail.com wrote:

If we had a backup node with all the data on, we'd exclude this node
when doing a release. Then if the release goes well it would start
building a brand new backup node (and the old one would becomes
obsolete and eventually deleted).

If the release does not go well (say the index data gets corrupted)
we'd discard the bad index nodes (and therefor all the corrupt data)
use the backup node from the previous release to re-populate the index
via regular replication.

However, if as you say we can't specify where shards get allocated
then we'll need to come up with a different solution. I thought it
might be possible because it must be possible to specify rules about
having a copy of the data in different data centers... or have a made
another incorrect assumption there?

On Jan 23, 6:02 pm, Berkay Mollamustafaoglu mber...@gmail.com wrote:

Afaik it is not possible to control where a replica goes. You can control
which nodes an index should go using shard allocation via tags, which you
may be able to create a config that may work.

More importantly, I don't think this would be a good way to have a hot
backup. Everything that happens to primary would also happen to replicas,
so if something goes wrong, it would go wrong with the replicas as well.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Jan 23, 2012 at 12:06 PM, snowmonkey npomf...@gmail.com wrote:

We have a cluster with 3 nodes (with 1 replica and 5 shards) and we'd
like to create a 4th node which will have a complete copy of all the
data. Is this possible? And if so how do we best achieve it? We only
want the complete copy on the new 4th node (not all nodes).

So perhaps having 2 replicas, but where I can specify the location of
one of the replicas?

FYI - the purpose of this is to create a 'hot backup' of the data
which we can use, should a release go bad and we need to roll-back.


(snowmonkey) #6

It possible that forced awareness can be used to achieve what I'm
after. I'll increase the replication factor from 1 to 2 and use zones
to force a copy of the data on to a 'back-up zone'...

So my original cluster of N nodes is arbitrarily split into 2 zones A
& B. A new node (the backup) is added in a zone of its own, C. And my
indexes are set to have 2 replicas. No matter where I write to (zone
A, B or C) it'll be forced to replicate to the other two zones, and as
the 'back-up' zone consists of only a single machine then it will be
guaranteed to have all of the data.

On Jan 23, 7:14 pm, Shay Banon kim...@gmail.com wrote:

You can specify where a replica shard will be allocated, or more
specifically, do something similar to rack aware deployment (even forced
one), see more herehttp://www.elasticsearch.org/guide/reference/modules/cluster.html(under
allocation awareness).

But, you are after a greater separation between a shard and a replica then
is provided. If you want to backup data, you can simply backup the data
location of each node. Hopefully, in the future, we will have a
backup/restore API in place.

On Mon, Jan 23, 2012 at 8:44 PM, snowmonkey npomf...@gmail.com wrote:

If we had a backup node with all the data on, we'd exclude this node
when doing a release. Then if the release goes well it would start
building a brand new backup node (and the old one would becomes
obsolete and eventually deleted).

If the release does not go well (say the index data gets corrupted)
we'd discard the bad index nodes (and therefor all the corrupt data)
use the backup node from the previous release to re-populate the index
via regular replication.

However, if as you say we can't specify where shards get allocated
then we'll need to come up with a different solution. I thought it
might be possible because it must be possible to specify rules about
having a copy of the data in different data centers... or have a made
another incorrect assumption there?

On Jan 23, 6:02 pm, Berkay Mollamustafaoglu mber...@gmail.com wrote:

Afaik it is not possible to control where a replica goes. You can control
which nodes an index should go using shard allocation via tags, which you
may be able to create a config that may work.

More importantly, I don't think this would be a good way to have a hot
backup. Everything that happens to primary would also happen to replicas,
so if something goes wrong, it would go wrong with the replicas as well.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Jan 23, 2012 at 12:06 PM, snowmonkey npomf...@gmail.com wrote:

We have a cluster with 3 nodes (with 1 replica and 5 shards) and we'd
like to create a 4th node which will have a complete copy of all the
data. Is this possible? And if so how do we best achieve it? We only
want the complete copy on the new 4th node (not all nodes).

So perhaps having 2 replicas, but where I can specify the location of
one of the replicas?

FYI - the purpose of this is to create a 'hot backup' of the data
which we can use, should a release go bad and we need to roll-back.


(derrickburns) #7

This answers a question that I had been contemplating! Thanks.


(Jeremy Jongsma) #8

Great, I've been looking for the "zones" solution for awhile to solve the
issue of latency across physical locations in one cluster.

But, it also seems like a disadvantage that a replica will never live
within the same allocation awareness attribute ("zone", in my case) as
another, even after all zones are filled. Right now it sounds like my
options are:

  1. use a single zone, and have better load-sharing due to more available
    replicas, but also introduce more latency when jumping across locations, or
  2. use multiple zones, but limit each zone to a single shard replica,
    pushing all query load for that shard onto one instance in each zone.

Is that accurate, or can multiple replicas exist in a zone if all other
zones already contain a replica? For example, what happens when I have
only two values for the zone awareness attribute, but
index.number_of_replicas = 3?


(Shay Banon) #9

Yes, a single zone can have more than one replica fo a shard. It will be
evened out between the zone, so if you have 2 zones, and an index with 3
replicas (totaling in 4 copies), you will have 2 copies of the shard on
each zone.

On Tue, Jan 24, 2012 at 1:31 AM, Jeremy Jongsma jjongsma@gmail.com wrote:

Great, I've been looking for the "zones" solution for awhile to solve the
issue of latency across physical locations in one cluster.

But, it also seems like a disadvantage that a replica will never live
within the same allocation awareness attribute ("zone", in my case) as
another, even after all zones are filled. Right now it sounds like my
options are:

  1. use a single zone, and have better load-sharing due to more available
    replicas, but also introduce more latency when jumping across locations, or
  2. use multiple zones, but limit each zone to a single shard replica,
    pushing all query load for that shard onto one instance in each zone.

Is that accurate, or can multiple replicas exist in a zone if all other
zones already contain a replica? For example, what happens when I have
only two values for the zone awareness attribute, but
index.number_of_replicas = 3?


(system) #10