Just Pushed: API: Allow to control document shard routing, and search shard routing


(Shay Banon) #1

Hi,

Just pushed support for explicit document routing and search routing.
Here is the issue (content copied below):
http://github.com/elasticsearch/elasticsearch/issues/issue/470.

Currently, when indexing or deleting documents, they are hashed based on the
type and id and routed to a shard based on the hash result. When searching,
the search request is a broadcast request to all shards within an index.

Sometimes, it make sense to control this routing. For example, indexing blob
posts for a specific user might be routed based on the user id. If routing
is controlled, then when performing a search on posts for only that user,
only the relevant shard that match the user id can be queried, resulting in
much faster search and overhaul less load on the system.

The index and delete operation now allow for a routing parameter to be
specified. When specified, that routing value will be used to control the
shard placement.

The get operation allows to specify a routing parameter as well, to
specify which shard the doc will be fetched from. Note, in our blog post
partitioned by user example, doing a lookup just by the post id will not be
enough and probably will not find anything, the user id will also need to be
specified in the routing parameter.

The search and count operations accept a routing parameter as well,
controlling which shards the search will be executed on. The routing
parameter accepts a comma separated list of the routing values to use, and
the all relevant shards will execute the query.

Note, even when specifying a search for a specific user blog posts using the
routing parameter set to the user id, filtering only the user posts is
still needed by, for example, adding a term filter with the user id.

The delete_by_query operation also accepts a routing parameter, which is
a comma separated list of routing values of controlling which shards the
delete query will be executed on.

One question that might arise is why not use indices to get the same
behavior. For example, create an index per user. The reason is that indices
have a much lower limit on how many of them can be created. A single machine
can easily support millions of users with millions of posts. Creating an
index per user will mean millions of indices, which is problematic, as even
with a single shard per index, it does mean millions of lucene indices.

-shay.banon


(hartzler-2) #2

Fantastic! I will definitely look into this.

I have been able to make good progress on a LRU index manager plugin for
dealing with the "millions of lucene indexes" problem. It uses the
open/close api you added earlier and works by extending the NodeClient
implementation to manage the current open set of indexes. Paired with a
reliable/available shared gateway config (I'm working on a cassandra gateway
plugin as well, the s3 gateway would work great if on ec2) this may provide
a workable approach to index per user strategy. Obviously this strategy
will require you to have a small percentage of your users active over a
reasonable time period (like 1-10% in any given hour).

However, being able to route to shards on a per user basis may well work
better for this scenario as well, testing will tell.

On Tue, Nov 2, 2010 at 1:00 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

Just pushed support for explicit document routing and search routing.
Here is the issue (content copied below):
http://github.com/elasticsearch/elasticsearch/issues/issue/470.

Currently, when indexing or deleting documents, they are hashed based on
the type and id and routed to a shard based on the hash result. When
searching, the search request is a broadcast request to all shards within an
index.

Sometimes, it make sense to control this routing. For example, indexing
blob posts for a specific user might be routed based on the user id. If
routing is controlled, then when performing a search on posts for only that
user, only the relevant shard that match the user id can be queried,
resulting in much faster search and overhaul less load on the system.

The index and delete operation now allow for a routing parameter to
be specified. When specified, that routing value will be used to control the
shard placement.

The get operation allows to specify a routing parameter as well, to
specify which shard the doc will be fetched from. Note, in our blog post
partitioned by user example, doing a lookup just by the post id will not be
enough and probably will not find anything, the user id will also need to be
specified in the routing parameter.

The search and count operations accept a routing parameter as well,
controlling which shards the search will be executed on. The routing
parameter accepts a comma separated list of the routing values to use, and
the all relevant shards will execute the query.

Note, even when specifying a search for a specific user blog posts using
the routing parameter set to the user id, filtering only the user posts is
still needed by, for example, adding a term filter with the user id.

The delete_by_query operation also accepts a routing parameter, which
is a comma separated list of routing values of controlling which shards the
delete query will be executed on.

One question that might arise is why not use indices to get the same
behavior. For example, create an index per user. The reason is that indices
have a much lower limit on how many of them can be created. A single machine
can easily support millions of users with millions of posts. Creating an
index per user will mean millions of indices, which is problematic, as even
with a single shard per index, it does mean millions of lucene indices.

-shay.banon


(system) #3