Yes, overallocation of shards is to accommodate cluster growth in certain
"data flows". For example, by default, when you create an index with
elasticsearch, it has 5 shards by default, even if you use just one node.
This allows the index to grow up to 5 nodes (assuming no replicas and no
other indices, so you have 1 shard per node) in terms of size (read perf
can be increased by increasing replicas (and nodes)).
Going back to "data flow". This is the one of the most important concept to
understand about your system when you use elasticsearch. Time based data
(tweets, logs) can be indexed using rolling time base indices, and then,
the number of shards per index are simply there so accomodate the expected
size for that time frame the index is responsible for. Aliases can be used
then to point to "latest" index to index to, or time base aliases can be
created that point to several indices to create "bigger" time frames
without needing to specify all the indices that compose that time frame.
A "user" based data flow, in theory, is perfect for an index per user case.
If you have enough nodes (each shard is a Lucene index, which has a cost)
in the cluster to do that, thats great, and several very large scale ES
users actually do that. But, if you have a small cluster (or constraint
budget / HW wise), you could say, ok, I will do all on a single index, but
then, that index will need quite a few shards to support potential growth.
This can be problematic without doing anything else, because if you create
an index with 30 shards on a single node cluster, each search request (for
that user data) will need to span all 30 shards.
This is where routing comes to play. You can have all the data for a user
allocated to a specific shard using routing (with clever use of routing,
you can actually have user data span several shards, for example, using
time base routing included in the user routing value).
This means that when you search for that user data, and provide the routing
value, it will only go to a single shard. You can create an index with 100
shards, and you won't incur the overhead of searching across all shards for
specific user data. Obviously, when you do that, you also need to filter
each query to only have that specific user data. Remember also, that you
can still search across several users data by simply using several routing
values.
For this usecase, aliases really shine. You can have a single index, lets
call it users, with 50 shards. And define an alias for each user. Lets say
the first user is user1, you define an alias called user1. That alias can
have a routing value of "user1", which means when you index and search
against that alais (as if it was an "index', i.e. /user1/_search), the
routing value will be automatically applied. But, you still need to filter
to only see that user data. For that, an alias can also be associated with
a filter (i.e. a term filter on the user name with the user value). Then,
every time you search against /user1/_search, the filter will be
automatically applied.
One nice aspect of it is that you can also search across several aliases.
For example, /user1,user2/_search (as you can search over multiple
indices), and it will "do the right thing". It will do a search with two
routing values, and an "OR'ed" filter.
-shay.banon
On Fri, Jan 20, 2012 at 12:40 AM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:
Hi,
Shay mentioned/suggested over-allocation of shards on a few occasions
and I wanted to make sure I understand why.
Is this primarily to accommodate future cluster growth, so that new
nodes can be added and existing shards re-distributed / re-balanced
over all old+new nodes?
Or is there some other or additional reason for this?
Thanks,
Otis
Sematext is Hiring World-Wide -- Jobs - Sematext