When/Why to use Routing for indexing/searching

I'm intrigued by the routing property that can be used during
indexing/searching. The documentation sort of explains how to use it, but
I feel like it's missing some recommendations on why one should use it;
under what conditions is it a good idea to start using this feature and
when using it isn't such a good idea. It doesn't appear to really be
necessary to specify the routing parameter at all, just there for a good
reason.

Obviously parent/child requires the child to be hosted on the same shard as
the parent, so that makes sense, but I get a sense there's also an
optimization for searches here that hitting shards that don't have any
results for that parameter is another good use case for routing.

I'm wondering for our use case whether we should be indexing items by their
'projectid' which is for our domain, the central 'root' concern of all data
elements, everything pretty much belongs to a project, so when indexing,
should I be using a routing based on the projectID so that all
project-related information indexed is nicely co-located together?
Generally people search on a project basis, but sometimes they want to
search across multiple projects, so we'd need to be able to spread that
search cross-project.

Does routing however create 'hot' shards that get hit more than others?
Does that matter anyway with replicas to distribute the load? I see from
the recent yFrog post they use routing (see [1] if you missed that, thanks
for that great post!). All users data is forced into that same shard as I
understand reading that post, but it's not clear to me what benefit that
has in the search case.

Can anyone else explain why they use routing for their domain ?

thanks!

Paul Smith
[1] yFrog -
http://elasticsearch-users.115913.n3.nabble.com/some-ES-stats-at-yfrog-com-td3759891.html

Hiya Paul

On Thu, 2012-03-01 at 16:51 +1100, Paul Smith wrote:

I'm intrigued by the routing property that can be used during
indexing/searching.

Me too. I'm in the process of writing a framework to use ES as my sole
data store. Although I'm not using routing myself yet, the framework has
to support it.

The documentation sort of explains how to use it, but I feel like it's
missing some recommendations on why one should use it; under what
conditions is it a good idea to start using this feature and when
using it isn't such a good idea. It doesn't appear to really be
necessary to specify the routing parameter at all, just there for a
good reason.

The main use I can see is if you have a large dataset (ie you need lots
of shards) which is easily sub-divided.

For instance, we have a single application which runs multiple
white-label sites for many different clients. While we occasionally need
to search across all clients, the web app only ever needs to query one
client at a time. So instead of hitting 10 shards, we could just hit
one, if we use the client ID for routing.

I suppose you could also say: use the user_id for routing for anything
that belongs to the user (even if there isn't an enforced parent-child
relationship).

This does introduce a complication though. A unique ID for a document
in a cluster actually consists of Index, Type, ID and Routing. It is
quite possible to have two docs with the same Index, Type and ID if you
specify different Routing values (although this would be unwise, as your
Routing value may end up being hashed to the same shard without you
realising it).

Does routing however create 'hot' shards that get hit more than
others? Does that matter anyway with replicas to distribute the load?

Potentially, yes. It is not obvious which shard you routing value would
point to. So (eg in my route-by-client-id example) we could end up with
our two biggest clients on the same shard, and our two smallest on the
same shard.

I see from the recent yFrog post they use routing (see [1] if you
missed that, thanks for that great post!). All users data is forced
into that same shard as I understand reading that post, but it's not
clear to me what benefit that has in the search case.

I'd be very interested in hearing how people are using routing. How
'dynamic' is the routing value?

clint

Heya, yea, what you mention, having a projectId as the routing can be a nice optimization, since then when you search on a project, you can just do the search on a single shard instead of broadcast across shards. This allows to have a considerably higher number of shards on the "products" index. I go into detail about it here: https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/data$20flow/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ.

On Thursday, March 1, 2012 at 7:51 AM, Paul Smith wrote:

I'm intrigued by the routing property that can be used during indexing/searching. The documentation sort of explains how to use it, but I feel like it's missing some recommendations on why one should use it; under what conditions is it a good idea to start using this feature and when using it isn't such a good idea. It doesn't appear to really be necessary to specify the routing parameter at all, just there for a good reason.

Obviously parent/child requires the child to be hosted on the same shard as the parent, so that makes sense, but I get a sense there's also an optimization for searches here that hitting shards that don't have any results for that parameter is another good use case for routing.

I'm wondering for our use case whether we should be indexing items by their 'projectid' which is for our domain, the central 'root' concern of all data elements, everything pretty much belongs to a project, so when indexing, should I be using a routing based on the projectID so that all project-related information indexed is nicely co-located together? Generally people search on a project basis, but sometimes they want to search across multiple projects, so we'd need to be able to spread that search cross-project.

Does routing however create 'hot' shards that get hit more than others? Does that matter anyway with replicas to distribute the load? I see from the recent yFrog post they use routing (see [1] if you missed that, thanks for that great post!). All users data is forced into that same shard as I understand reading that post, but it's not clear to me what benefit that has in the search case.

Can anyone else explain why they use routing for their domain ?

thanks!

Paul Smith
[1] yFrog - http://elasticsearch-users.115913.n3.nabble.com/some-ES-stats-at-yfrog-com-td3759891.html

Shay,

Will a single value for a routing id (say in this case "project123") always
resove to a single shard or will ES manage multiple shards per routing
value if we exceed a size threshold?

--Mike

On Thu, Mar 1, 2012 at 7:40 AM, Shay Banon kimchy@gmail.com wrote:

Heya, yea, what you mention, having a projectId as the routing can be a
nice optimization, since then when you search on a project, you can just do
the search on a single shard instead of broadcast across shards. This
allows to have a considerably higher number of shards on the "products"
index. I go into detail about it here:
https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/data$20flow/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ
.

On Thursday, March 1, 2012 at 7:51 AM, Paul Smith wrote:

I'm intrigued by the routing property that can be used during
indexing/searching. The documentation sort of explains how to use it, but
I feel like it's missing some recommendations on why one should use it;
under what conditions is it a good idea to start using this feature and
when using it isn't such a good idea. It doesn't appear to really be
necessary to specify the routing parameter at all, just there for a good
reason.

Obviously parent/child requires the child to be hosted on the same shard
as the parent, so that makes sense, but I get a sense there's also an
optimization for searches here that hitting shards that don't have any
results for that parameter is another good use case for routing.

I'm wondering for our use case whether we should be indexing items by
their 'projectid' which is for our domain, the central 'root' concern of
all data elements, everything pretty much belongs to a project, so when
indexing, should I be using a routing based on the projectID so that all
project-related information indexed is nicely co-located together?
Generally people search on a project basis, but sometimes they want to
search across multiple projects, so we'd need to be able to spread that
search cross-project.

Does routing however create 'hot' shards that get hit more than others?
Does that matter anyway with replicas to distribute the load? I see from
the recent yFrog post they use routing (see [1] if you missed that, thanks
for that great post!). All users data is forced into that same shard as I
understand reading that post, but it's not clear to me what benefit that
has in the search case.

Can anyone else explain why they use routing for their domain ?

thanks!

Paul Smith
[1] yFrog -
http://elasticsearch-users.115913.n3.nabble.com/some-ES-stats-at-yfrog-com-td3759891.html

We do have a scenario where we use routing for a productive system in my
company (the main ecommerce site for LATAM)

The site manages million of users selling million of items, so that each
user needs to manage its own items in a 'My Account' site. So, for such
system, we created an ES index for all items and search them using the
seller ID as a filter.

At indexing time we route documents using the seller ID, so that items from
the same seller go to the same shard/s and thus we balance the required
space among servers (we can consider the distribution of items per user
normal). For improving search times, as Shay said, we route user searches
using the same seller ID so that they hit the right servers and not all of
them.

So, although of course some sellers have far more items than the rest, and
that may slightly overload some servers, the number of users is big enough
to have the load perfectly distributed using this schema. At the other
hand, a possible alternative to this approach would be to have 1 index per
user, but we weren't sure how ES would beahve with million of indices and
we preferred to have all items under the same "umbrella" as it simplify
administration tasks when we need to search for items accross all users.

Hope it helps! Cheers,
Frederic

On Thursday, 1 March 2012 06:52:22 UTC-3, Clinton Gormley wrote:

Hiya Paul

On Thu, 2012-03-01 at 16:51 +1100, Paul Smith wrote:

I'm intrigued by the routing property that can be used during
indexing/searching.

Me too. I'm in the process of writing a framework to use ES as my sole
data store. Although I'm not using routing myself yet, the framework has
to support it.

The documentation sort of explains how to use it, but I feel like it's
missing some recommendations on why one should use it; under what
conditions is it a good idea to start using this feature and when
using it isn't such a good idea. It doesn't appear to really be
necessary to specify the routing parameter at all, just there for a
good reason.

The main use I can see is if you have a large dataset (ie you need lots
of shards) which is easily sub-divided.

For instance, we have a single application which runs multiple
white-label sites for many different clients. While we occasionally need
to search across all clients, the web app only ever needs to query one
client at a time. So instead of hitting 10 shards, we could just hit
one, if we use the client ID for routing.

I suppose you could also say: use the user_id for routing for anything
that belongs to the user (even if there isn't an enforced parent-child
relationship).

This does introduce a complication though. A unique ID for a document
in a cluster actually consists of Index, Type, ID and Routing. It is
quite possible to have two docs with the same Index, Type and ID if you
specify different Routing values (although this would be unwise, as your
Routing value may end up being hashed to the same shard without you
realising it).

Does routing however create 'hot' shards that get hit more than
others? Does that matter anyway with replicas to distribute the load?

Potentially, yes. It is not obvious which shard you routing value would
point to. So (eg in my route-by-client-id example) we could end up with
our two biggest clients on the same shard, and our two smallest on the
same shard.

I see from the recent yFrog post they use routing (see [1] if you
missed that, thanks for that great post!). All users data is forced
into that same shard as I understand reading that post, but it's not
clear to me what benefit that has in the search case.

I'd be very interested in hearing how people are using routing. How
'dynamic' is the routing value?

clint

It will always resolve to a single shard.

On Thursday, March 1, 2012 at 4:38 PM, Michael Sick wrote:

Shay,

Will a single value for a routing id (say in this case "project123") always resove to a single shard or will ES manage multiple shards per routing value if we exceed a size threshold?

--Mike

On Thu, Mar 1, 2012 at 7:40 AM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:

Heya, yea, what you mention, having a projectId as the routing can be a nice optimization, since then when you search on a project, you can just do the search on a single shard instead of broadcast across shards. This allows to have a considerably higher number of shards on the "products" index. I go into detail about it here: https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/data$20flow/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ.

On Thursday, March 1, 2012 at 7:51 AM, Paul Smith wrote:

I'm intrigued by the routing property that can be used during indexing/searching. The documentation sort of explains how to use it, but I feel like it's missing some recommendations on why one should use it; under what conditions is it a good idea to start using this feature and when using it isn't such a good idea. It doesn't appear to really be necessary to specify the routing parameter at all, just there for a good reason.

Obviously parent/child requires the child to be hosted on the same shard as the parent, so that makes sense, but I get a sense there's also an optimization for searches here that hitting shards that don't have any results for that parameter is another good use case for routing.

I'm wondering for our use case whether we should be indexing items by their 'projectid' which is for our domain, the central 'root' concern of all data elements, everything pretty much belongs to a project, so when indexing, should I be using a routing based on the projectID so that all project-related information indexed is nicely co-located together? Generally people search on a project basis, but sometimes they want to search across multiple projects, so we'd need to be able to spread that search cross-project.

Does routing however create 'hot' shards that get hit more than others? Does that matter anyway with replicas to distribute the load? I see from the recent yFrog post they use routing (see [1] if you missed that, thanks for that great post!). All users data is forced into that same shard as I understand reading that post, but it's not clear to me what benefit that has in the search case.

Can anyone else explain why they use routing for their domain ?

thanks!

Paul Smith
[1] yFrog - http://elasticsearch-users.115913.n3.nabble.com/some-ES-stats-at-yfrog-com-td3759891.html

On 1 March 2012 23:40, Shay Banon kimchy@gmail.com wrote:

Heya, yea, what you mention, having a projectId as the routing can be a
nice optimization, since then when you search on a project, you can just do
the search on a single shard instead of broadcast across shards. This
allows to have a considerably higher number of shards on the "products"
index. I go into detail about it here:
https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/data$20flow/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ
.

That's a great link Shay, thanks very much. Hitting a single shard and not
needing to merge the per-shard result sets is a nice local optimization,
particularly when the size of the result set may be large (the pathological
case of needing 'all' results sorted is exacerbated when split across
shards, as I understand it).

I was thinking about our project distribution, some projects are huge, many
tens of millions of items, whereas some are quite small, and I wondered
about the problems of 'hotness' of a specific shard, but I think it ends up
being better when it's only querying a single shard and the hotness is good
for the filesystem cache anyway. Replica's is the way to then distribute
the read load of searches.

thanks again,

Paul

Usually, for people using the routing feature and are concerned with "hotness" of a specific user/project, is that if a specific project/user becomes really big, you can always move it to its own index. Aliases allow you do it without affecting the client code, so instead of having an alias with routing value and filter pointing to your multi tenant index, you will move it to point to an index that is associated only with the mentioned large project.

On Friday, March 2, 2012 at 12:50 AM, Paul Smith wrote:

On 1 March 2012 23:40, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:

Heya, yea, what you mention, having a projectId as the routing can be a nice optimization, since then when you search on a project, you can just do the search on a single shard instead of broadcast across shards. This allows to have a considerably higher number of shards on the "products" index. I go into detail about it here: https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/data$20flow/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ.

That's a great link Shay, thanks very much. Hitting a single shard and not needing to merge the per-shard result sets is a nice local optimization, particularly when the size of the result set may be large (the pathological case of needing 'all' results sorted is exacerbated when split across shards, as I understand it).

I was thinking about our project distribution, some projects are huge, many tens of millions of items, whereas some are quite small, and I wondered about the problems of 'hotness' of a specific shard, but I think it ends up being better when it's only querying a single shard and the hotness is good for the filesystem cache anyway. Replica's is the way to then distribute the read load of searches.

thanks again,

Paul

Shay indicated here that routing will always resolve to a single shard.

So, is ES maintaining an internal "index" (by "index", i mean someway of identifying the specific shard for a specific routing id)?

ES would need to somehow map a specific routing id to a specific shard, I would presume.

If so, is there an overhead of maintaining such a relationship? (from memory used and insertion time standpoint)? For e.g., What is the overhead in how ES processes a document coming in without needing a routing would VS a document coming in needing a specific routing?

"http://elasticsearch-users.115913.n3.nabble.com/When-Why-to-use-Routing-for-indexing-searching-td3789570.html#a3790713"

Thanks in advance for responses,
Elan.