We do have a scenario where we use routing for a productive system in my
company (the main ecommerce site for LATAM)
The site manages million of users selling million of items, so that each
user needs to manage its own items in a 'My Account' site. So, for such
system, we created an ES index for all items and search them using the
seller ID as a filter.
At indexing time we route documents using the seller ID, so that items from
the same seller go to the same shard/s and thus we balance the required
space among servers (we can consider the distribution of items per user
normal). For improving search times, as Shay said, we route user searches
using the same seller ID so that they hit the right servers and not all of
them.
So, although of course some sellers have far more items than the rest, and
that may slightly overload some servers, the number of users is big enough
to have the load perfectly distributed using this schema. At the other
hand, a possible alternative to this approach would be to have 1 index per
user, but we weren't sure how ES would beahve with million of indices and
we preferred to have all items under the same "umbrella" as it simplify
administration tasks when we need to search for items accross all users.
Hope it helps! Cheers,
Frederic
On Thursday, 1 March 2012 06:52:22 UTC-3, Clinton Gormley wrote:
Hiya Paul
On Thu, 2012-03-01 at 16:51 +1100, Paul Smith wrote:
I'm intrigued by the routing property that can be used during
indexing/searching.
Me too. I'm in the process of writing a framework to use ES as my sole
data store. Although I'm not using routing myself yet, the framework has
to support it.
The documentation sort of explains how to use it, but I feel like it's
missing some recommendations on why one should use it; under what
conditions is it a good idea to start using this feature and when
using it isn't such a good idea. It doesn't appear to really be
necessary to specify the routing parameter at all, just there for a
good reason.
The main use I can see is if you have a large dataset (ie you need lots
of shards) which is easily sub-divided.
For instance, we have a single application which runs multiple
white-label sites for many different clients. While we occasionally need
to search across all clients, the web app only ever needs to query one
client at a time. So instead of hitting 10 shards, we could just hit
one, if we use the client ID for routing.
I suppose you could also say: use the user_id for routing for anything
that belongs to the user (even if there isn't an enforced parent-child
relationship).
This does introduce a complication though. A unique ID for a document
in a cluster actually consists of Index, Type, ID and Routing. It is
quite possible to have two docs with the same Index, Type and ID if you
specify different Routing values (although this would be unwise, as your
Routing value may end up being hashed to the same shard without you
realising it).
Does routing however create 'hot' shards that get hit more than
others? Does that matter anyway with replicas to distribute the load?
Potentially, yes. It is not obvious which shard you routing value would
point to. So (eg in my route-by-client-id example) we could end up with
our two biggest clients on the same shard, and our two smallest on the
same shard.
I see from the recent yFrog post they use routing (see [1] if you
missed that, thanks for that great post!). All users data is forced
into that same shard as I understand reading that post, but it's not
clear to me what benefit that has in the search case.
I'd be very interested in hearing how people are using routing. How
'dynamic' is the routing value?
clint