Replication basics


(Nick Hoffman) #1

Hey guys. I'm running an instance of ES on 2 different hosts. If I want to
have their data synchronized automatically, do I just do the following?

  1. In each host's ES config file,
    disable discovery.zen.ping.multicast.enabled .

  2. In each host's ES config file, set discovery.zen.ping.unicast.hosts to
    the other host's hostname/IP address.

  3. Restart ES on each host.

  4. Configure each index to have 1 shard and 1 replica.

Thanks!
Nick


(Shay Banon) #2

You are doing two things here:

  1. Use unicast instead of multicast discovery. Thats fine, you disable
    multicast in this case correctly, and set the hosts in the unicast list. I
    suggest simply setting a list of all hosts in the unicast hosts list
    (including the current one, thats fine) so the config file will be the same.

  2. If you want to have an a 1 replica for each shard, yea, have
    index.number_of_replicas of 1 (the default). As to the number of shards, up
    to you, 1 can be a good number, but it means your index size won't be able
    to scale over one machine. You can have more shards for the index, and if
    you, in the future, add more machines, it will scale to use those machines.
    See more here:
    http://www.elasticsearch.org/videos/2010/02/08/es-distributed-diagram.html.

-shay.banon

On Wed, Jan 11, 2012 at 7:03 AM, Nick Hoffman nick@deadorange.com wrote:

Hey guys. I'm running an instance of ES on 2 different hosts. If I want to
have their data synchronized automatically, do I just do the following?

  1. In each host's ES config file,
    disable discovery.zen.ping.multicast.enabled .

  2. In each host's ES config file, set discovery.zen.ping.unicast.hosts to
    the other host's hostname/IP address.

  3. Restart ES on each host.

  4. Configure each index to have 1 shard and 1 replica.

Thanks!
Nick


(Nick Hoffman) #3

On Wednesday, 11 January 2012 13:06:30 UTC-5, kimchy wrote:

You are doing two things here:

  1. Use unicast instead of multicast discovery. Thats fine, you disable
    multicast in this case correctly, and set the hosts in the unicast list. I
    suggest simply setting a list of all hosts in the unicast hosts list
    (including the current one, thats fine) so the config file will be the same.

That's a lot more convenient.

  1. If you want to have an a 1 replica for each shard, yea, have
    index.number_of_replicas of 1 (the default). As to the number of shards, up
    to you, 1 can be a good number, but it means your index size won't be able
    to scale over one machine. You can have more shards for the index, and if
    you, in the future, add more machines, it will scale to use those machines.
    See more here:
    http://www.elasticsearch.org/videos/2010/02/08/es-distributed-diagram.html
    .

I was thinking that I'd leave each index with 1 shard, and when I add more
nodes in the future, I'd increase the number of shards accordingly.

With only 2 nodes, is there any benefit to having more than 1 shard per
index?

Thanks for your help, Shay. Much appreciated.
Nick


(Ævar Arnfjörð Bjarmason) #4

You can't change the number of shards later on, only the number of
replicas. Making it say 8 instead of 1 gives you room to grow without
needing to rebuild the index.
On Jan 11, 2012 9:58 PM, "Nick Hoffman" nick@deadorange.com wrote:

On Wednesday, 11 January 2012 13:06:30 UTC-5, kimchy wrote:

You are doing two things here:

  1. Use unicast instead of multicast discovery. Thats fine, you disable
    multicast in this case correctly, and set the hosts in the unicast list. I
    suggest simply setting a list of all hosts in the unicast hosts list
    (including the current one, thats fine) so the config file will be the same.

That's a lot more convenient.

  1. If you want to have an a 1 replica for each shard, yea, have
    index.number_of_replicas of 1 (the default). As to the number of shards, up
    to you, 1 can be a good number, but it means your index size won't be able
    to scale over one machine. You can have more shards for the index, and if
    you, in the future, add more machines, it will scale to use those machines.
    See more here: http://www.elasticsearch.org/videos/2010/
    02/08/es-distributed-diagram.**htmlhttp://www.elasticsearch.org/videos/2010/02/08/es-distributed-diagram.html
    .

I was thinking that I'd leave each index with 1 shard, and when I add more
nodes in the future, I'd increase the number of shards accordingly.

With only 2 nodes, is there any benefit to having more than 1 shard per
index?

Thanks for your help, Shay. Much appreciated.
Nick


(Nick Hoffman) #5

On Wednesday, 11 January 2012 18:59:42 UTC-5, Ævar Arnfjörð Bjarmason wrote:

You can't change the number of shards later on, only the number of
replicas. Making it say 8 instead of 1 gives you room to grow without
needing to rebuild the index.

Ah, true. I'd forgotten that a reindex is required to change the number of
shards.

If you have only 2 nodes, and set each index to have 8 shards:

  1. Will each index be sharded across the 2 nodes?

  2. Will adding a 3rd node rebalance the indexes across all 3 nodes?

Cheers,
Nick


(Shay Banon) #6

On Thu, Jan 12, 2012 at 2:36 AM, Nick Hoffman nick@deadorange.com wrote:

On Wednesday, 11 January 2012 18:59:42 UTC-5, Ævar Arnfjörð Bjarmason
wrote:

You can't change the number of shards later on, only the number of
replicas. Making it say 8 instead of 1 gives you room to grow without
needing to rebuild the index.

Ah, true. I'd forgotten that a reindex is required to change the number of
shards.

If you have only 2 nodes, and set each index to have 8 shards:

  1. Will each index be sharded across the 2 nodes?

Yes, you will have 8 shards on one node, and 8 replicas on the other.

  1. Will adding a 3rd node rebalance the indexes across all 3 nodes?

Yes.

Cheers,
Nick


(Nick Hoffman) #7

On Thursday, 12 January 2012 05:24:07 UTC-5, kimchy wrote:

On Thu, Jan 12, 2012 at 2:36 AM, Nick Hoffman ni...@deadorange.comwrote:

If you have only 2 nodes, and set each index to have 8 shards:

  1. Will each index be sharded across the 2 nodes?

Yes, you will have 8 shards on one node, and 8 replicas on the other.

That's awesome. My only question is how do you determine how many shards to
use? 2, 5, 8, etc seem like arbitrary numbers.

Thanks again, Shay.


(Shay Banon) #8

Yea, deciding the number requires a bit of work. What I usually recommend
people is to create a single index with a single shard, on a single
instance and "shove" data to it. Find that upper limit based on your data
set and search requirements. You should get a number out of it. Something
like N docs, or X size. Then, you can extrapolate from there.

On Thu, Jan 12, 2012 at 6:14 PM, Nick Hoffman nick@deadorange.com wrote:

On Thursday, 12 January 2012 05:24:07 UTC-5, kimchy wrote:

On Thu, Jan 12, 2012 at 2:36 AM, Nick Hoffman ni...@deadorange.comwrote:

If you have only 2 nodes, and set each index to have 8 shards:

  1. Will each index be sharded across the 2 nodes?

Yes, you will have 8 shards on one node, and 8 replicas on the other.

That's awesome. My only question is how do you determine how many shards
to use? 2, 5, 8, etc seem like arbitrary numbers.

Thanks again, Shay.


(Nick Hoffman) #9

On Thursday, 12 January 2012 18:17:23 UTC-5, kimchy wrote:

Yea, deciding the number requires a bit of work. What I usually recommend
people is to create a single index with a single shard, on a single
instance and "shove" data to it. Find that upper limit based on your data
set and search requirements. You should get a number out of it. Something
like N docs, or X size. Then, you can extrapolate from there.

Interesting. Once you find the max number of docs that can be stored in a
single instance, for example, what exactly do you extrapolate from that
number?

I guess some applications will have a finite number of docs indexed, which
would enable you to calculate how many shards you'll need when you reach
that finite number.

However, if there's no upper limit to the number of documents that you'll
index, I'm not sure how knowing how many docs can be stored in an instance
will help.

Cheers,
Nick


(Berkay Mollamustafaoglu-2) #10

Nick,

I think you can use index aliases and create new indices as number of docs
increase
http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

ElasticSearch allows assigning the same alias to multiple indices. Taking
advantage of this feature, once you reach the max number of docs or size
you've determined, create a new index, associate the alias with new index.
In a way, having an alias with multiple indices is similar to having an
index with multiple shards.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Jan 13, 2012 at 12:11 AM, Nick Hoffman nick@deadorange.com wrote:

On Thursday, 12 January 2012 18:17:23 UTC-5, kimchy wrote:

Yea, deciding the number requires a bit of work. What I usually recommend
people is to create a single index with a single shard, on a single
instance and "shove" data to it. Find that upper limit based on your data
set and search requirements. You should get a number out of it. Something
like N docs, or X size. Then, you can extrapolate from there.

Interesting. Once you find the max number of docs that can be stored in a
single instance, for example, what exactly do you extrapolate from that
number?

I guess some applications will have a finite number of docs indexed, which
would enable you to calculate how many shards you'll need when you reach
that finite number.

However, if there's no upper limit to the number of documents that you'll
index, I'm not sure how knowing how many docs can be stored in an instance
will help.

Cheers,
Nick


(Shay Banon) #11

First of all, it will help with your sizing both your data and the number
of nodes you will need. As to systems that don't have a limit to the number
of docs, it really depends on how the data flows. In theory, my blog has no
limit on the number posts it will have, but I am not going to post 1
billion posts on it.

First, the most common case. You run your tests, and see that a single
shard can hold 100 million docs. In your wildest expectation, you are not
going to cross 1 billion. Thats simple, have an index with 10 shards,
enough hardware, and be done with it. Thats where most system fall, yet
many "think" they need to support unlimited data, and over-engineer.

For continuos stream of data, like log files, you can use the ability to
create indices on the fly, so you can have an index per week / month and so
on. A system that index the tweets that come along can use that as well.
You can easily search over several indices.

Another type is data that can be highly partitioned. User based data for
example. You can use the username to do the routing (both when indexing and
searching), and create a considerable number of shards. The number of
shards does not affect search - on limited HW - since you don't search on
all of them, you route based on user name. For example, you can create a
200 shard system on 4 nodes and that will work fine.

-shay.banon

On Fri, Jan 13, 2012 at 7:11 AM, Nick Hoffman nick@deadorange.com wrote:

On Thursday, 12 January 2012 18:17:23 UTC-5, kimchy wrote:

Yea, deciding the number requires a bit of work. What I usually recommend
people is to create a single index with a single shard, on a single
instance and "shove" data to it. Find that upper limit based on your data
set and search requirements. You should get a number out of it. Something
like N docs, or X size. Then, you can extrapolate from there.

Interesting. Once you find the max number of docs that can be stored in a
single instance, for example, what exactly do you extrapolate from that
number?

I guess some applications will have a finite number of docs indexed, which
would enable you to calculate how many shards you'll need when you reach
that finite number.

However, if there's no upper limit to the number of documents that you'll
index, I'm not sure how knowing how many docs can be stored in an instance
will help.

Cheers,
Nick


(Nick Hoffman) #12

Thanks for taking the time to explain that, Shay. It makes a lot of sense.

Cheers,
Nick


(Treff7es) #13

Just to feed my curiosity:
Is there any drawback of creating new index?
Is there any limitation how many indices can be created?

Thanks,
Tamas

On Jan 13, 9:12 am, Shay Banon kim...@gmail.com wrote:

First of all, it will help with your sizing both your data and the number
of nodes you will need. As to systems that don't have a limit to the number
of docs, it really depends on how the data flows. In theory, my blog has no
limit on the number posts it will have, but I am not going to post 1
billion posts on it.

First, the most common case. You run your tests, and see that a single
shard can hold 100 million docs. In your wildest expectation, you are not
going to cross 1 billion. Thats simple, have an index with 10 shards,
enough hardware, and be done with it. Thats where most system fall, yet
many "think" they need to support unlimited data, and over-engineer.

For continuos stream of data, like log files, you can use the ability to
create indices on the fly, so you can have an index per week / month and so
on. A system that index the tweets that come along can use that as well.
You can easily search over several indices.

Another type is data that can be highly partitioned. User based data for
example. You can use the username to do the routing (both when indexing and
searching), and create a considerable number of shards. The number of
shards does not affect search - on limited HW - since you don't search on
all of them, you route based on user name. For example, you can create a
200 shard system on 4 nodes and that will work fine.

-shay.banon

On Fri, Jan 13, 2012 at 7:11 AM, Nick Hoffman n...@deadorange.com wrote:

On Thursday, 12 January 2012 18:17:23 UTC-5, kimchy wrote:

Yea, deciding the number requires a bit of work. What I usually recommend
people is to create a single index with a single shard, on a single
instance and "shove" data to it. Find that upper limit based on your data
set and search requirements. You should get a number out of it. Something
like N docs, or X size. Then, you can extrapolate from there.

Interesting. Once you find the max number of docs that can be stored in a
single instance, for example, what exactly do you extrapolate from that
number?

I guess some applications will have a finite number of docs indexed, which
would enable you to calculate how many shards you'll need when you reach
that finite number.

However, if there's no upper limit to the number of documents that you'll
index, I'm not sure how knowing how many docs can be stored in an instance
will help.

Cheers,
Nick- Hide quoted text -

  • Show quoted text -

(Shay Banon) #14

On Mon, Jan 16, 2012 at 12:14 PM, Treff7es treff7es@gmail.com wrote:

Just to feed my curiosity:
Is there any drawback of creating new index?
Is there any limitation how many indices can be created?

There is a limit, mainly coming from resources associated with an index (or
more specifically a shard). Each shard is a Lucene index, that comes with
its overhead (memory usage, file descriptors). So, creating an 1000 indices
(with a single shard lets say) on a single node will probably tax the
system too much. Usually, you solve this with things like routing and using
a single index with a considerable more shards than you might need (since
when you search, you will hit a single shard).

Thanks,
Tamas

On Jan 13, 9:12 am, Shay Banon kim...@gmail.com wrote:

First of all, it will help with your sizing both your data and the number
of nodes you will need. As to systems that don't have a limit to the
number
of docs, it really depends on how the data flows. In theory, my blog has
no
limit on the number posts it will have, but I am not going to post 1
billion posts on it.

First, the most common case. You run your tests, and see that a single
shard can hold 100 million docs. In your wildest expectation, you are not
going to cross 1 billion. Thats simple, have an index with 10 shards,
enough hardware, and be done with it. Thats where most system fall, yet
many "think" they need to support unlimited data, and over-engineer.

For continuos stream of data, like log files, you can use the ability to
create indices on the fly, so you can have an index per week / month and
so
on. A system that index the tweets that come along can use that as well.
You can easily search over several indices.

Another type is data that can be highly partitioned. User based data for
example. You can use the username to do the routing (both when indexing
and
searching), and create a considerable number of shards. The number of
shards does not affect search - on limited HW - since you don't search on
all of them, you route based on user name. For example, you can create a
200 shard system on 4 nodes and that will work fine.

-shay.banon

On Fri, Jan 13, 2012 at 7:11 AM, Nick Hoffman n...@deadorange.com
wrote:

On Thursday, 12 January 2012 18:17:23 UTC-5, kimchy wrote:

Yea, deciding the number requires a bit of work. What I usually
recommend

people is to create a single index with a single shard, on a single
instance and "shove" data to it. Find that upper limit based on your
data

set and search requirements. You should get a number out of it.
Something

like N docs, or X size. Then, you can extrapolate from there.

Interesting. Once you find the max number of docs that can be stored
in a

single instance, for example, what exactly do you extrapolate from that
number?

I guess some applications will have a finite number of docs indexed,
which

would enable you to calculate how many shards you'll need when you
reach

that finite number.

However, if there's no upper limit to the number of documents that
you'll

index, I'm not sure how knowing how many docs can be stored in an
instance

will help.

Cheers,
Nick- Hide quoted text -

  • Show quoted text -

(system) #15