Limit searches to specific shards


(jason-3) #1

Is it possible to create some sort of client (Transport client, Node
client or REST) that does either?

  1. only searches the shards local to the server the client is
    connected to (for a given index/type/etc)
    OR
  2. client specifies which shards the search is limited to (for a given
    index/type/etc).

I don't think the "routing" feature supports limiting a search to a
specified shard number.

Here is my use case. We would like to use ES as both the index and
data store. We need to be able to run MapReduce across all the data
in the datastore periodically. Each node runs ES, a Hadoop DataNode,
and Hadoop TaskTracker. I want to create a Hadoop InputFormat that is
truly data local based on the shards. We evaluated wonderdog and it
works okay, but having truly data local searches would be more
scalable.

If the client could limit searches to specific shards or limit
searches to only local shards then I think this would be possible
since I can determine which shards reside on which machines via (curl -
s -XGET 'http://localhost:9200/_cluster/state' and equivalent Java
code).

If this search limiting functionality doesn't exist, would it be
appropriate for a plugin or is this an internals change that is more
far reaching?

Thanks,

--Jason


(Shay Banon) #2

Heya,

Got it, It should be simple to add. We already have a preference
parameter, need to find a good way to express what you are after as a value.
Open an issue?

-shay.banon

On Mon, Oct 10, 2011 at 3:25 PM, jason jason.trost@gmail.com wrote:

Is it possible to create some sort of client (Transport client, Node
client or REST) that does either?

  1. only searches the shards local to the server the client is
    connected to (for a given index/type/etc)
    OR
  2. client specifies which shards the search is limited to (for a given
    index/type/etc).

I don't think the "routing" feature supports limiting a search to a
specified shard number.

Here is my use case. We would like to use ES as both the index and
data store. We need to be able to run MapReduce across all the data
in the datastore periodically. Each node runs ES, a Hadoop DataNode,
and Hadoop TaskTracker. I want to create a Hadoop InputFormat that is
truly data local based on the shards. We evaluated wonderdog and it
works okay, but having truly data local searches would be more
scalable.

If the client could limit searches to specific shards or limit
searches to only local shards then I think this would be possible
since I can determine which shards reside on which machines via (curl -
s -XGET 'http://localhost:9200/_cluster/state' and equivalent Java
code).

If this search limiting functionality doesn't exist, would it be
appropriate for a plugin or is this an internals change that is more
far reaching?

Thanks,

--Jason


(Shay Banon) #3

Ok, opened https://github.com/elasticsearch/elasticsearch/issues/1388 and
pushed the implementation to master. See if it works for you.

On Wed, Oct 12, 2011 at 8:38 PM, Shay Banon kimchy@gmail.com wrote:

Heya,

Got it, It should be simple to add. We already have a preference
parameter, need to find a good way to express what you are after as a value.
Open an issue?

-shay.banon

On Mon, Oct 10, 2011 at 3:25 PM, jason jason.trost@gmail.com wrote:

Is it possible to create some sort of client (Transport client, Node
client or REST) that does either?

  1. only searches the shards local to the server the client is
    connected to (for a given index/type/etc)
    OR
  2. client specifies which shards the search is limited to (for a given
    index/type/etc).

I don't think the "routing" feature supports limiting a search to a
specified shard number.

Here is my use case. We would like to use ES as both the index and
data store. We need to be able to run MapReduce across all the data
in the datastore periodically. Each node runs ES, a Hadoop DataNode,
and Hadoop TaskTracker. I want to create a Hadoop InputFormat that is
truly data local based on the shards. We evaluated wonderdog and it
works okay, but having truly data local searches would be more
scalable.

If the client could limit searches to specific shards or limit
searches to only local shards then I think this would be possible
since I can determine which shards reside on which machines via (curl -
s -XGET 'http://localhost:9200/_cluster/state' and equivalent Java
code).

If this search limiting functionality doesn't exist, would it be
appropriate for a plugin or is this an internals change that is more
far reaching?

Thanks,

--Jason


(jason-3) #4

Awesome. Thanks Shay. I will check this out tomorrow.

On Wed, Oct 12, 2011 at 3:11 PM, Shay Banon kimchy@gmail.com wrote:

Ok, opened https://github.com/elasticsearch/elasticsearch/issues/1388 and
pushed the implementation to master. See if it works for you.

On Wed, Oct 12, 2011 at 8:38 PM, Shay Banon kimchy@gmail.com wrote:

Heya,

Got it, It should be simple to add. We already have a preference
parameter, need to find a good way to express what you are after as a value.
Open an issue?

-shay.banon

On Mon, Oct 10, 2011 at 3:25 PM, jason jason.trost@gmail.com wrote:

Is it possible to create some sort of client (Transport client, Node
client or REST) that does either?

  1. only searches the shards local to the server the client is
    connected to (for a given index/type/etc)
    OR
  2. client specifies which shards the search is limited to (for a given
    index/type/etc).

I don't think the "routing" feature supports limiting a search to a
specified shard number.

Here is my use case. We would like to use ES as both the index and
data store. We need to be able to run MapReduce across all the data
in the datastore periodically. Each node runs ES, a Hadoop DataNode,
and Hadoop TaskTracker. I want to create a Hadoop InputFormat that is
truly data local based on the shards. We evaluated wonderdog and it
works okay, but having truly data local searches would be more
scalable.

If the client could limit searches to specific shards or limit
searches to only local shards then I think this would be possible
since I can determine which shards reside on which machines via (curl -
s -XGET 'http://localhost:9200/_cluster/state' and equivalent Java
code).

If this search limiting functionality doesn't exist, would it be
appropriate for a plugin or is this an internals change that is more
far reaching?

Thanks,

--Jason


(system) #5