Processing data in ES (in sequential approach). What would be the best approach?

Lukas_Vlcek1 · January 11, 2012, 2:55pm

Hi,

I need to implement some document enhancing functionality and I am looking
for the best practices/examples/references about how to do it. Basically I
need the following:
1/ the code would be scheduled to start every minute or so
2/ it would pull data from one index, process it and insert (or update)
into other index
3/ I need to make sure that at certain points there is only a single
processing unit for whole cluster (parallel execution could lead to
inaccurate results)

Given the above points I think I can implement it as a river, especially
due to #3 but I have also concerns about it:

if the code has some bugs (for example memory leaks), what would be the
impact on cluster if it runs as a river? And is there anything I can do to
minimize risk that crappy river hurts ES cluster?
although river can execute any general code, its origin is to allow for
pull/push data from external sources. Isn't it serious misuse to use river
to get data from one index and index it into another index within the same
cluster?
as for the scheduling, I know it is possible start Java Timer inside the
river but isn't there any built in scheduling API in ES that I could use
instead?

Regards,
Lukas

kimchy · January 12, 2012, 10:29am

On Wed, Jan 11, 2012 at 4:55 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

I need to implement some document enhancing functionality and I am looking
for the best practices/examples/references about how to do it. Basically I
need the following:
1/ the code would be scheduled to start every minute or so
2/ it would pull data from one index, process it and insert (or update)
into other index
3/ I need to make sure that at certain points there is only a single
processing unit for whole cluster (parallel execution could lead to
inaccurate results)

Given the above points I think I can implement it as a river, especially
due to #3 but I have also concerns about it:

if the code has some bugs (for example memory leaks), what would be the
impact on cluster if it runs as a river? And is there anything I can do to
minimize risk that crappy river hurts ES cluster?

Not much, if it leaks memory then it will cause OOM on that node.

although river can execute any general code, its origin is to allow for
pull/push data from external sources. Isn't it serious misuse to use river
to get data from one index and index it into another index within the same
cluster?

It can be done, don't think its a misuse.

as for the scheduling, I know it is possible start Java Timer inside the
river but isn't there any built in scheduling API in ES that I could use
instead?

I suggest you use your own, thats fine. Just make sure to close it when the
river closes.

Regards,
Lukas

Topic		Replies	Views
Document Processing Elasticsearch	3	789	July 6, 2017
Parallel document processing across nodes? Elasticsearch	2	1065	May 14, 2018
How to implement river that indexes external data sources using Java? Elasticsearch	4	735	July 6, 2017
How to implement new 'river' for ES? Elasticsearch	5	361	July 6, 2017
Auto-optimize plugin & Cluster Singleton Plugin infrastructure Elasticsearch	6	458	July 6, 2017

Processing data in ES (in sequential approach). What would be the best approach?

Related topics