How to leverage ES for load balanced processing of data in and ES index


(Mauri) #1

I have a situation where I need to perform consolidation, analysis and
reporting tasks on data indexed in ES.
I would like to leverage ES to load balance or distribute these
processing tasks across a group of servers.
I am using the java API.

I have looked at the possibility of using native scripts and my
understanding is that native scripts are executed within the context
of queries whereby query requests are sent to all shards in an index
and each shard executes the native script locally for each search hit
and adds the output of the native script to the search hit. Not quite
what I am looking for.

The behaviour I am looking for is more like index and get requests
which are routed to a shard based on the index routing rules and
document key values (index/type/id).

Is there a mechanism for executing native scripts for index/put/get
requests whereby I could sending a request to ES which gets routed to
a cluster node based on a document key and then have that node execute
a native script and return a result document generated by that script?
The native script may also create new documents which are added to the
index for later retrieval. This key based routing would also allow
each cluster node to maintain working data in memory or local disc for
the tasks it performs.

If this is not already supported doe the ES architecture make it
possible for me to write a plug in that adds custom actions and action
handlers that are compatible with the ES transports and index/get
request routing? If this is possible can you direct me to java source
code examples of this and/or give me an overview of the ES classes I
would have to subclass and how to register these via a plugin.

Regards
Mauri


(Shay Banon) #2

No, there ins't a way to do it. Why not do it outside of ES?

On Monday, March 5, 2012 at 9:09 AM, Mauri wrote:

I have a situation where I need to perform consolidation, analysis and
reporting tasks on data indexed in ES.
I would like to leverage ES to load balance or distribute these
processing tasks across a group of servers.
I am using the java API.

I have looked at the possibility of using native scripts and my
understanding is that native scripts are executed within the context
of queries whereby query requests are sent to all shards in an index
and each shard executes the native script locally for each search hit
and adds the output of the native script to the search hit. Not quite
what I am looking for.

The behaviour I am looking for is more like index and get requests
which are routed to a shard based on the index routing rules and
document key values (index/type/id).

Is there a mechanism for executing native scripts for index/put/get
requests whereby I could sending a request to ES which gets routed to
a cluster node based on a document key and then have that node execute
a native script and return a result document generated by that script?
The native script may also create new documents which are added to the
index for later retrieval. This key based routing would also allow
each cluster node to maintain working data in memory or local disc for
the tasks it performs.

If this is not already supported doe the ES architecture make it
possible for me to write a plug in that adds custom actions and action
handlers that are compatible with the ES transports and index/get
request routing? If this is possible can you direct me to java source
code examples of this and/or give me an overview of the ES classes I
would have to subclass and how to register these via a plugin.

Regards
Mauri


(Mauri) #3

Hi Shay

Hoping to leverage ES request routing rather than have to implement
something myself or bring in another framework to handle it.
Have been looking at the code for the attachment type and it looks
feasible to use a custom type as follows:

  • Create a custom field type 'task' and register a 'TaskMapper' class
    using a plugin in the same way that AttachmentMapper is registered for
    the attachment type
  • Use the index API to index a document containing one or more 'task'
    fields that contain the parameters for the tasks to be performed.
  • The index request is routed by ES to a shard based on the key (index/
    type/id) assigned to the document
  • When the document is indexed the parse method of the TaskMapper is
    called for each task field and TaskMapper processes the task similar
    to how tika is called by AttachmentMapper

Not sure yet how the results generated by the task could be brought
out, possibly by passing on to lucene in the way tika output is
handled by AttachmentMapper, only to a field with store=true so that
the full result is stored in Lucene rather than just being indexed.

Regards
Mauri


(Shay Banon) #4

I thought you said you wanted to generate several documents for a "single json", thats why I said that it makes sense to do it on the client side interacting with elasticsearch.

On Tuesday, March 6, 2012 at 1:16 AM, Mauri wrote:

Hi Shay

Hoping to leverage ES request routing rather than have to implement
something myself or bring in another framework to handle it.
Have been looking at the code for the attachment type and it looks
feasible to use a custom type as follows:

  • Create a custom field type 'task' and register a 'TaskMapper' class
    using a plugin in the same way that AttachmentMapper is registered for
    the attachment type
  • Use the index API to index a document containing one or more 'task'
    fields that contain the parameters for the tasks to be performed.
  • The index request is routed by ES to a shard based on the key (index/
    type/id) assigned to the document
  • When the document is indexed the parse method of the TaskMapper is
    called for each task field and TaskMapper processes the task similar
    to how tika is called by AttachmentMapper

Not sure yet how the results generated by the task could be brought
out, possibly by passing on to lucene in the way tika output is
handled by AttachmentMapper, only to a field with store=true so that
the full result is stored in Lucene rather than just being indexed.

Regards
Mauri


(system) #5