Choosing Shards and Replica's configuration values


(Paul Smith) #1

I may have missed it in the documentation, but I'm trying to work out in my
own mind what I'd choose for the # Shards and # replicas if we migrated over
our home grown lucene service. right now we don't actually have any
sharding, and our replica's are done via simple 'pause, snapshot the disk,
and resume' indexing so that our secondary and tertiary sites can have a
regular sync point. Basic but it works.

One can update the # replica's via the API, but I'm not sure where to find
'good rules of thumbs' on configuring sharding #'s, which needs to be baked
into the index creation action from what I can tell.

Is it purely just based on the likely # of available nodes in the cluster?
(ie 1:1 mapping between number of shards and number of nodes), or does the
community recommend something else? Do most people use the replica number
in a sort of RAID- like configuration, so that there's a tradeoff of keeping
more nodes replica's up2date to survive more nodes going down etc.

To me, as a relative newbie to ES, I think it would be good to have some
rough guides in the documentation or wiki?

Additionally to that I think some documentation about real world experiences
with actual fail-over cases would be worth having in the Wiki/docs, about
when real nodes catch fire and die, and what is the expected behaviour
(rough recovery times, how to rebuild a new replacement node and bring it
back into the cluster to take over the dead one etc). From reading the
docs, it looks like it's self-healing (subject to proper # replica's etc),
perhaps with some sluggishness while the cluster adjusts to shifting
workload onto nodes to take the burden of the deceased, but actually hearing
from people using ES and working through these real world cases would be
worth collecting?

cheers,

Paul Smith


(ppearcy) #2

Hi,
We're trying to push to production soon. I am not quite an expert,
so don't take this as gospel :slight_smile:

Replicas are for availability and performance. Shards are to scale.
You should start sharding if your index is close to being too large to
perform well on a single server. In general, you want to keep shards
under 10GB.

For replica count, we basically use two configs high traffic vs low
traffic. High traffic indexes we have a replica on each server (ie,
replicas=#nodes-1) and for low traffic we have one replica for
availability.

In my testing, the biggest performance hit occurs when a shard gets
allocated and the filter caches need to be built up. Our performance
depends on the ES internal caches and if your queries don't this
probably won't be as bad, but then again there is probably some
overhead building the system disk cache, too. This only occurs when a
new node is added to the cluster. We are mitigating this by throttling
shard recovery to one at a time and clearing the work directory on a
node on start up, to ensure a slow recovery. We also have a large
index count which helps slow the recovery further. A little counter
intuitive, but results in best performance for us under heavy load
during recovery scenarios.

If a node is misbehaving, it is best to get it out of the cluster
ASAP. One slow behaving node can disproportionally slow down the whole
cluster.

Also, no manual rebuild or similar needed. Start a new node with your
cluster settings and shards auto allocate.

Regards,
Paul

On Sep 27, 8:12 pm, Paul Smith tallpsm...@gmail.com wrote:

I may have missed it in the documentation, but I'm trying to work out in my
own mind what I'd choose for the # Shards and # replicas if we migrated over
our home grown lucene service. right now we don't actually have any
sharding, and our replica's are done via simple 'pause, snapshot the disk,
and resume' indexing so that our secondary and tertiary sites can have a
regular sync point. Basic but it works.

One can update the # replica's via the API, but I'm not sure where to find
'good rules of thumbs' on configuring sharding #'s, which needs to be baked
into the index creation action from what I can tell.

Is it purely just based on the likely # of available nodes in the cluster?
(ie 1:1 mapping between number of shards and number of nodes), or does the
community recommend something else? Do most people use the replica number
in a sort of RAID- like configuration, so that there's a tradeoff of keeping
more nodes replica's up2date to survive more nodes going down etc.

To me, as a relative newbie to ES, I think it would be good to have some
rough guides in the documentation or wiki?

Additionally to that I think some documentation about real world experiences
with actual fail-over cases would be worth having in the Wiki/docs, about
when real nodes catch fire and die, and what is the expected behaviour
(rough recovery times, how to rebuild a new replacement node and bring it
back into the cluster to take over the dead one etc). From reading the
docs, it looks like it's self-healing (subject to proper # replica's etc),
perhaps with some sluggishness while the cluster adjusts to shifting
workload onto nodes to take the burden of the deceased, but actually hearing
from people using ES and working through these real world cases would be
worth collecting?

cheers,

Paul Smith


(Paul Smith) #3

We are mitigating this by throttling

shard recovery to one at a time and clearing the work directory on a node on

start up, to ensure a slow recovery.

ooh, it's little tidbits like this that are really useful. How do you do
the shard recovery one at a time though..? I'm presuming you mean by
clearing out the work directory, you're for all intents and purposes telling
ES that this is a 'brand new fresh' node, and not trying to do
any reconciliation with the on disk shard info with the replicas being
taken.

That is, to rephrase, instead of trying to recover a dead node, basically
'putting a bullet in it' and treating it like a fresh node?

Paul


(Shay Banon) #4

Hi,

These are all good questions. Let me answer the simple ones first:

  1. Controlling the number of recoveries are done by setting
    "cluster.routing.allocation.concurrent_recoveries". It defaults to the
    number of processors + 1 (though its usually IO intensive, had to come out
    with something...). This controls the number of concurrent recoveries
    allowed on a specific node. Note, this is a 0.11 setting (to be released in
    a day or so).

  2. I am not sure about the clearing the number of work dir. It will increase
    the time it takes to recover (as local data will not be reused) and of
    course should not be done when using upcoming local gateway in 0.11. I would
    say you usually don't want to do it.

Number of shards and number of replicas is a tough one to answer, since it
really depends on the system. Basically, more shards means faster indexing
(assuming you have enough machines to utilize the fact that there are more
shards), and a greater upper limit for the index size. Basically, if you
have a 10 shards index, it can only grow up to 10 machines x disk size for
the index size.

Replicas are there to both improve high availability and search performance.
Its easy not to worry about them that much as there is an API to increase /
decrease it on the fly.

Back to the number of shards setting. I would like to explain why one can't
change it. The cost of doing some sort of "repartitioning" of the data in an
index is very high. If just the recover can be taxing (though it should be
fast, compressed, and in 0.11 even faster), imagine how taxing it can be to
have to reindex part of the data to create a new partitioning scheme
(increase the number of shards). As well as potentially create a long pause
where the index is useless.

For this reason, I went with fixed number of shards, but tried to solve it
in a different manner. You can always create more indices. And to complete
the picture, you can always search on more than one index.

So, how to set the number of shards? I would say first run a test that
loads, lets say, 10% of the expected data size. You can get a feeling for
the index size required (there is an API for that, the indices status). From
that you can extrapolate both what you need now, and what you will need when
you grow xN times.

It gets a bit more interesting for really unbounded systems, like indexing
log files. For that, you can create an index per week of logs. Each start of
a week, you start a new index and index data into it. This gives you full
scale out model and you can always search on a month by simply searching on
the last 4 weeks (for example).

This allows you also flexibility, for example, the last week can be really
hot when it comes to searching on, so it can have more replicas. Older weeks
can have less replicas as they will be searched on less. Another nice aspect
of this design is the fact that deleted old "weeks" is a snap, its just a
matter of deleting that index and not deleting all the log entries
associated with that week within a single index.

-shay.banon

On Tue, Sep 28, 2010 at 7:55 AM, Paul Smith tallpsmith@gmail.com wrote:

We are mitigating this by throttling

shard recovery to one at a time and clearing the work directory on a node

on start up, to ensure a slow recovery.

ooh, it's little tidbits like this that are really useful. How do you do
the shard recovery one at a time though..? I'm presuming you mean by
clearing out the work directory, you're for all intents and purposes telling
ES that this is a 'brand new fresh' node, and not trying to do
any reconciliation with the on disk shard info with the replicas being
taken.

That is, to rephrase, instead of trying to recover a dead node, basically
'putting a bullet in it' and treating it like a fresh node?

Paul


(ppearcy) #5

Yeah, clearing the work directory is only important for us while under
high load. If we are not under high load, we don't need to jump
through any hoops during node bring up.

Thanks

On Sep 28, 3:31 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

These are all good questions. Let me answer the simple ones first:

  1. Controlling the number of recoveries are done by setting
    "cluster.routing.allocation.concurrent_recoveries". It defaults to the
    number of processors + 1 (though its usually IO intensive, had to come out
    with something...). This controls the number of concurrent recoveries
    allowed on a specific node. Note, this is a 0.11 setting (to be released in
    a day or so).

  2. I am not sure about the clearing the number of work dir. It will increase
    the time it takes to recover (as local data will not be reused) and of
    course should not be done when using upcoming local gateway in 0.11. I would
    say you usually don't want to do it.

Number of shards and number of replicas is a tough one to answer, since it
really depends on the system. Basically, more shards means faster indexing
(assuming you have enough machines to utilize the fact that there are more
shards), and a greater upper limit for the index size. Basically, if you
have a 10 shards index, it can only grow up to 10 machines x disk size for
the index size.

Replicas are there to both improve high availability and search performance.
Its easy not to worry about them that much as there is an API to increase /
decrease it on the fly.

Back to the number of shards setting. I would like to explain why one can't
change it. The cost of doing some sort of "repartitioning" of the data in an
index is very high. If just the recover can be taxing (though it should be
fast, compressed, and in 0.11 even faster), imagine how taxing it can be to
have to reindex part of the data to create a new partitioning scheme
(increase the number of shards). As well as potentially create a long pause
where the index is useless.

For this reason, I went with fixed number of shards, but tried to solve it
in a different manner. You can always create more indices. And to complete
the picture, you can always search on more than one index.

So, how to set the number of shards? I would say first run a test that
loads, lets say, 10% of the expected data size. You can get a feeling for
the index size required (there is an API for that, the indices status). From
that you can extrapolate both what you need now, and what you will need when
you grow xN times.

It gets a bit more interesting for really unbounded systems, like indexing
log files. For that, you can create an index per week of logs. Each start of
a week, you start a new index and index data into it. This gives you full
scale out model and you can always search on a month by simply searching on
the last 4 weeks (for example).

This allows you also flexibility, for example, the last week can be really
hot when it comes to searching on, so it can have more replicas. Older weeks
can have less replicas as they will be searched on less. Another nice aspect
of this design is the fact that deleted old "weeks" is a snap, its just a
matter of deleting that index and not deleting all the log entries
associated with that week within a single index.

-shay.banon

On Tue, Sep 28, 2010 at 7:55 AM, Paul Smith tallpsm...@gmail.com wrote:

We are mitigating this by throttling

shard recovery to one at a time and clearing the work directory on a node

on start up, to ensure a slow recovery.

ooh, it's little tidbits like this that are really useful. How do you do
the shard recovery one at a time though..? I'm presuming you mean by
clearing out the work directory, you're for all intents and purposes telling
ES that this is a 'brand new fresh' node, and not trying to do
any reconciliation with the on disk shard info with the replicas being
taken.

That is, to rephrase, instead of trying to recover a dead node, basically
'putting a bullet in it' and treating it like a fresh node?

Paul


(huyle) #6

Number of shards and number of replicas is a tough one to answer, since it really depends on the system. Basically, more shards means faster indexing (assuming you have enough machines to utilize the fact that there are more shards), and a greater upper limit for the index size. Basically, if you have a 10 shards index, it can only grow up to 10 machines x disk size for the index size

If we can have 10 machines max, does it make an sense to create 1000 shards? Is there any kind of recommendation regarding ratio between max number of machines and number of shards?

I noticed initializing index structure get slower and slower as we increase the number of shards. It took 17 hours to initialize two indices with 10000 shards. One index has 3 string fields, other index has 25 string fields, 3 double fields, 5 long fields, 6 date fields and 4 boolean fields. So it's impractical to create an index with 10000 shards.

Thanks!

Huy


(huyle) #7

I should also add that initialization performance is not linear either. With the same indices mentioned above, it took 11 minutes to initialize them when number of shards is configured at 200, and it took 1 minute when number of shards is at 100.

This is with ES 0.13.0.

Huy


(Shay Banon) #8

If you plan to have 10 machines, having 1000 shards certainly does not make sense, since, assuming you have 1 replica for each shard, it will max out on 2000 machines... .

For a 10 machines cluster, and have the flexibility to grow to a 20 machine cluster, and assuming we are talking about a single index, then I would say create that index with 10-15 shards, with 1 or 2 replicas.
On Tuesday, January 25, 2011 at 12:32 AM, huyle wrote:

I should also add that initialization performance is not linear either. With
the same indices mentioned above, it took 11 minutes to initialize them
when number of shards is configured at 200, and it took 1 minute when number
of shards is at 100.

This is with ES 0.13.0.

Huy

View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Choosing-Shards-and-Replica-s-configuration-values-tp1593807p2324066.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(huyle) #9

Thanks Shay! I configure the cluster with 2 replicas and currently have 2 indices.

Huy

If you plan to have 10 machines, having 1000 shards certainly does not make sense, since, assuming you have 1 replica for each shard, it will max out on 2000 machines... .

For a 10 machines cluster, and have the flexibility to grow to a 20 machine cluster, and assuming we are talking about a single index, then I would say create that index with 10-15 shards, with 1 or 2 replicas.
On Tuesday, January 25, 2011 at 12:32 AM, huyle wrote:

I should also add that initialization performance is not linear either. With
the same indices mentioned above, it took 11 minutes to initialize them
when number of shards is configured at 200, and it took 1 minute when number
of shards is at 100.

This is with ES 0.13.0.


(Shay Banon) #10

And how many shards have you configured per index? (the index.number_of_shards setting)
On Tuesday, January 25, 2011 at 9:53 PM, huyle wrote:

Thanks Shay! I configure the cluster with 2 replicas and currently have 2
indices.

Huy

kimchy wrote:

If you plan to have 10 machines, having 1000 shards certainly does not
make sense, since, assuming you have 1 replica for each shard, it will max
out on 2000 machines... .

For a 10 machines cluster, and have the flexibility to grow to a 20
machine cluster, and assuming we are talking about a single index, then I
would say create that index with 10-15 shards, with 1 or 2 replicas.
On Tuesday, January 25, 2011 at 12:32 AM, huyle wrote:

I should also add that initialization performance is not linear either.
With
the same indices mentioned above, it took 11 minutes to initialize them
when number of shards is configured at 200, and it took 1 minute when
number
of shards is at 100.

This is with ES 0.13.0.

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Choosing-Shards-and-Replica-s-configuration-values-tp1593807p2333483.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(huyle) #11

10 machines were mentioned as scenario for the purpose of discussion. The number of machines in on production environment will be based on the outcome of our evaluation process. We current use Solr but we are planning to migrate to ES.

With that said, our test environment currently has 6 machines, and it's configured with 100 shards per index. And as mentioned, 2 replicas and 2 indices. In production environment, we would like to start out with small number of machines, probably 6 to 10, but we plan to scale up to 100 or more machines.

Huy

And how many shards have you configured per index? (the index.number_of_shards setting) On Tuesday, January 25, 2011 at 9:53 PM, huyle wrote:

Thanks Shay! I configure the cluster with 2 replicas and currently have 2
indices.


(system) #12