Processing data in ES (in sequential approach). What would be the best approach?

Hi,

I need to implement some document enhancing functionality and I am looking
for the best practices/examples/references about how to do it. Basically I
need the following:
1/ the code would be scheduled to start every minute or so
2/ it would pull data from one index, process it and insert (or update)
into other index
3/ I need to make sure that at certain points there is only a single
processing unit for whole cluster (parallel execution could lead to
inaccurate results)

Given the above points I think I can implement it as a river, especially
due to #3 but I have also concerns about it:

  • if the code has some bugs (for example memory leaks), what would be the
    impact on cluster if it runs as a river? And is there anything I can do to
    minimize risk that crappy river hurts ES cluster?
  • although river can execute any general code, its origin is to allow for
    pull/push data from external sources. Isn't it serious misuse to use river
    to get data from one index and index it into another index within the same
    cluster?
  • as for the scheduling, I know it is possible start Java Timer inside the
    river but isn't there any built in scheduling API in ES that I could use
    instead?

Regards,
Lukas

On Wed, Jan 11, 2012 at 4:55 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

I need to implement some document enhancing functionality and I am looking
for the best practices/examples/references about how to do it. Basically I
need the following:
1/ the code would be scheduled to start every minute or so
2/ it would pull data from one index, process it and insert (or update)
into other index
3/ I need to make sure that at certain points there is only a single
processing unit for whole cluster (parallel execution could lead to
inaccurate results)

Given the above points I think I can implement it as a river, especially
due to #3 but I have also concerns about it:

  • if the code has some bugs (for example memory leaks), what would be the
    impact on cluster if it runs as a river? And is there anything I can do to
    minimize risk that crappy river hurts ES cluster?

Not much, if it leaks memory then it will cause OOM on that node.

  • although river can execute any general code, its origin is to allow for
    pull/push data from external sources. Isn't it serious misuse to use river
    to get data from one index and index it into another index within the same
    cluster?

It can be done, don't think its a misuse.

  • as for the scheduling, I know it is possible start Java Timer inside the
    river but isn't there any built in scheduling API in ES that I could use
    instead?

I suggest you use your own, thats fine. Just make sure to close it when the
river closes.

Regards,
Lukas