I need to implement some document enhancing functionality and I am looking
for the best practices/examples/references about how to do it. Basically I
need the following:
1/ the code would be scheduled to start every minute or so
2/ it would pull data from one index, process it and insert (or update)
into other index
3/ I need to make sure that at certain points there is only a single
processing unit for whole cluster (parallel execution could lead to
inaccurate results)
Given the above points I think I can implement it as a river, especially
due to #3 but I have also concerns about it:
if the code has some bugs (for example memory leaks), what would be the
impact on cluster if it runs as a river? And is there anything I can do to
minimize risk that crappy river hurts ES cluster?
although river can execute any general code, its origin is to allow for
pull/push data from external sources. Isn't it serious misuse to use river
to get data from one index and index it into another index within the same
cluster?
as for the scheduling, I know it is possible start Java Timer inside the
river but isn't there any built in scheduling API in ES that I could use
instead?
I need to implement some document enhancing functionality and I am looking
for the best practices/examples/references about how to do it. Basically I
need the following:
1/ the code would be scheduled to start every minute or so
2/ it would pull data from one index, process it and insert (or update)
into other index
3/ I need to make sure that at certain points there is only a single
processing unit for whole cluster (parallel execution could lead to
inaccurate results)
Given the above points I think I can implement it as a river, especially
due to #3 but I have also concerns about it:
if the code has some bugs (for example memory leaks), what would be the
impact on cluster if it runs as a river? And is there anything I can do to
minimize risk that crappy river hurts ES cluster?
Not much, if it leaks memory then it will cause OOM on that node.
although river can execute any general code, its origin is to allow for
pull/push data from external sources. Isn't it serious misuse to use river
to get data from one index and index it into another index within the same
cluster?
It can be done, don't think its a misuse.
as for the scheduling, I know it is possible start Java Timer inside the
river but isn't there any built in scheduling API in ES that I could use
instead?
I suggest you use your own, thats fine. Just make sure to close it when the
river closes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.