Over-allocation of shards

Hi,

Shay mentioned/suggested over-allocation of shards on a few occasions
and I wanted to make sure I understand why.

Is this primarily to accommodate future cluster growth, so that new
nodes can be added and existing shards re-distributed / re-balanced
over all old+new nodes?

Or is there some other or additional reason for this?

Thanks,
Otis

Sematext is Hiring World-Wide -- http://sematext.com/about/jobs.html

Yes, overallocation of shards is to accommodate cluster growth in certain
"data flows". For example, by default, when you create an index with
elasticsearch, it has 5 shards by default, even if you use just one node.
This allows the index to grow up to 5 nodes (assuming no replicas and no
other indices, so you have 1 shard per node) in terms of size (read perf
can be increased by increasing replicas (and nodes)).

Going back to "data flow". This is the one of the most important concept to
understand about your system when you use elasticsearch. Time based data
(tweets, logs) can be indexed using rolling time base indices, and then,
the number of shards per index are simply there so accomodate the expected
size for that time frame the index is responsible for. Aliases can be used
then to point to "latest" index to index to, or time base aliases can be
created that point to several indices to create "bigger" time frames
without needing to specify all the indices that compose that time frame.

A "user" based data flow, in theory, is perfect for an index per user case.
If you have enough nodes (each shard is a Lucene index, which has a cost)
in the cluster to do that, thats great, and several very large scale ES
users actually do that. But, if you have a small cluster (or constraint
budget / HW wise), you could say, ok, I will do all on a single index, but
then, that index will need quite a few shards to support potential growth.
This can be problematic without doing anything else, because if you create
an index with 30 shards on a single node cluster, each search request (for
that user data) will need to span all 30 shards.

This is where routing comes to play. You can have all the data for a user
allocated to a specific shard using routing (with clever use of routing,
you can actually have user data span several shards, for example, using
time base routing included in the user routing value).

This means that when you search for that user data, and provide the routing
value, it will only go to a single shard. You can create an index with 100
shards, and you won't incur the overhead of searching across all shards for
specific user data. Obviously, when you do that, you also need to filter
each query to only have that specific user data. Remember also, that you
can still search across several users data by simply using several routing
values.

For this usecase, aliases really shine. You can have a single index, lets
call it users, with 50 shards. And define an alias for each user. Lets say
the first user is user1, you define an alias called user1. That alias can
have a routing value of "user1", which means when you index and search
against that alais (as if it was an "index', i.e. /user1/_search), the
routing value will be automatically applied. But, you still need to filter
to only see that user data. For that, an alias can also be associated with
a filter (i.e. a term filter on the user name with the user value). Then,
every time you search against /user1/_search, the filter will be
automatically applied.

One nice aspect of it is that you can also search across several aliases.
For example, /user1,user2/_search (as you can search over multiple
indices), and it will "do the right thing". It will do a search with two
routing values, and an "OR'ed" filter.

-shay.banon

On Fri, Jan 20, 2012 at 12:40 AM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

Hi,

Shay mentioned/suggested over-allocation of shards on a few occasions
and I wanted to make sure I understand why.

Is this primarily to accommodate future cluster growth, so that new
nodes can be added and existing shards re-distributed / re-balanced
over all old+new nodes?

Or is there some other or additional reason for this?

Thanks,
Otis

Sematext is Hiring World-Wide -- Jobs - Sematext

On Jan 19, 3:02 pm, Shay Banon kim...@gmail.com wrote:

Yes, overallocation of shards is to accommodate cluster growth in certain
"data flows". For example, by default, when you create an index with
elasticsearch, it has 5 shards by default, even if you use just one node.
This allows the index to grow up to 5 nodes (assuming no replicas and no
other indices, so you have 1 shard per node) in terms of size (read perf
can be increased by increasing replicas (and nodes)).

I've wondered about the default setting of 5 shards as well. If I know
how much data I'm indexing (e.g. indexing an archive of tweets), I'll
want to change that setting. If I don't (e.g. indexing tweets in
"realtime"), I want to start with 1 shard (good enough for most
users), and re-shard to more than that only when the number of
documents passes a threshold. This requires rebuilding the index;
would be nice if elasticsearch had a simple command to do that.

Thank you Shay for a thorough answer - we've been making use of
exactly these index alias/sharding/routing functionality in a number
of projects where we used Elasticsearch. As a matter of fact, this is
from 4 hours ago: http://twitter.com/#!/otisg/status/160128277497913345

And thanks for confirming the over-allocation bit!

Otis

Sematext is Hiring World-Wide -- Jobs - Sematext

On Jan 19, 6:02 pm, Shay Banon kim...@gmail.com wrote:

Yes, overallocation of shards is to accommodate cluster growth in certain
"data flows". For example, by default, when you create an index with
elasticsearch, it has 5 shards by default, even if you use just one node.
This allows the index to grow up to 5 nodes (assuming no replicas and no
other indices, so you have 1 shard per node) in terms of size (read perf
can be increased by increasing replicas (and nodes)).

Going back to "data flow". This is the one of the most important concept to
understand about your system when you use elasticsearch. Time based data
(tweets, logs) can be indexed using rolling time base indices, and then,
the number of shards per index are simply there so accomodate the expected
size for that time frame the index is responsible for. Aliases can be used
then to point to "latest" index to index to, or time base aliases can be
created that point to several indices to create "bigger" time frames
without needing to specify all the indices that compose that time frame.

A "user" based data flow, in theory, is perfect for an index per user case.
If you have enough nodes (each shard is a Lucene index, which has a cost)
in the cluster to do that, thats great, and several very large scale ES
users actually do that. But, if you have a small cluster (or constraint
budget / HW wise), you could say, ok, I will do all on a single index, but
then, that index will need quite a few shards to support potential growth.
This can be problematic without doing anything else, because if you create
an index with 30 shards on a single node cluster, each search request (for
that user data) will need to span all 30 shards.

This is where routing comes to play. You can have all the data for a user
allocated to a specific shard using routing (with clever use of routing,
you can actually have user data span several shards, for example, using
time base routing included in the user routing value).

This means that when you search for that user data, and provide the routing
value, it will only go to a single shard. You can create an index with 100
shards, and you won't incur the overhead of searching across all shards for
specific user data. Obviously, when you do that, you also need to filter
each query to only have that specific user data. Remember also, that you
can still search across several users data by simply using several routing
values.

For this usecase, aliases really shine. You can have a single index, lets
call it users, with 50 shards. And define an alias for each user. Lets say
the first user is user1, you define an alias called user1. That alias can
have a routing value of "user1", which means when you index and search
against that alais (as if it was an "index', i.e. /user1/_search), the
routing value will be automatically applied. But, you still need to filter
to only see that user data. For that, an alias can also be associated with
a filter (i.e. a term filter on the user name with the user value). Then,
every time you search against /user1/_search, the filter will be
automatically applied.

One nice aspect of it is that you can also search across several aliases.
For example, /user1,user2/_search (as you can search over multiple
indices), and it will "do the right thing". It will do a search with two
routing values, and an "OR'ed" filter.

-shay.banon

On Fri, Jan 20, 2012 at 12:40 AM, Otis Gospodnetic <

otis.gospodne...@gmail.com> wrote:

Hi,

Shay mentioned/suggested over-allocation of shards on a few occasions
and I wanted to make sure I understand why.

Is this primarily to accommodate future cluster growth, so that new
nodes can be added and existing shards re-distributed / re-balanced
over all old+new nodes?

Or is there some other or additional reason for this?

Thanks,
Otis

Sematext is Hiring World-Wide --Jobs - Sematext

Hi Eric

I've wondered about the default setting of 5 shards as well. If I know
how much data I'm indexing (e.g. indexing an archive of tweets), I'll
want to change that setting.

You can specify that when creating the index

If I don't (e.g. indexing tweets in
"realtime"), I want to start with 1 shard (good enough for most
users), and re-shard to more than that only when the number of
documents passes a threshold. This requires rebuilding the index;
would be nice if elasticsearch had a simple command to do that.

you can use several indices and create a new index if documents
passes a threshold (+ use the index alias feature) ...
No need to recreate shards ...

Peter.

On Fri, Jan 20, 2012 at 03:00, Karussell tableyourtime@googlemail.com wrote:

I've wondered about the default setting of 5 shards as well. If I know
how much data I'm indexing (e.g. indexing an archive of tweets), I'll
want to change that setting.

You can specify that when creating the index

That's what I do; just pointing out that I never had a case where I
wanted to start an index with precisely 5 shards.

I'm assuming that queries on an index with 5 shards running on a
single machine are going to be slower and use more resources than
queries on an index with a single shard, right?

you can use several indices and create a new index if documents
passes a threshold (+ use the index alias feature) ...
No need to recreate shards ...

Hadn't considered that approach! I imagine that internally, shards are
similar to aliased indexes? So if there was a simple way to replace
"partition by hash" with "partition by row count" (and have shards
created lazily)...

on a single machine are going to be slower

yes, of course.

Hadn't considered that approach! I imagine that internally, shards are
similar to aliased indexes? So if there was a simple way to replace
"partition by hash" with "partition by row count" (and have shards
created lazily)...

there is a similar feature (not really what you are after) .. via
routing. but creating 'shards' lazily can only be done via creating
'indices' which are under your control.

Peter.

Hi,

When you use this method : 1 index + routing.
It is not possible to apply an analyser by alias as when we work with an
index, isnt it ?

I have an other question:
I have listened the Shay Banon's presentation
http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html

and I have a question:
I dont understand exactly the difference to build 1 shard per index and
this method to build 1 index + routing.
I understood that it's better for the system ressource but not exactly why ?

Thanks in advance,
Christophe.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ok, i understood my second question.
More of one user can use a same shard.

And are you ok with my first question ?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.