I did find a way to prevent nodes from running rivers.
I set this to true on both my data nodes and false on my master node. So
far so good.
I also noticed that multiple rivers ran on different nodes. I'm not
certain if this was a side effect of that setting or a coincidence. I'll
be doing more testing in the next couple weeks.
All your ideas have merit. There is definitely room to improve rivers.
Not always running and a reliable status would be huge.
On Saturday, December 7, 2013 6:10:50 AM UTC-5, Jörg Prante wrote:
I think the concept of river is broken. For example:
it is assumed that river instances shall always run. If a node fails,
all the river instances on that node are started again on other nodes. The
idea is to run river contiuously without interruption so no data gets lost.
the river cluster service does not watch what river instances are
currently doing and what the river instance state is since the river
instance state is private to the river.
if a river instance is deleted, the river cluster service must know this
instance is permanently removed. But what is permanently if you can
recreate a river instance under the same name after a deletion?
There were discussions that rivers may be deprecated in favor of message
queues like logstash.
I think it would be a good idea to improve the river concept to a truly
rivers should be aware of many river instances in parallel so they could
share the work by dividing the workload
a river instance should always be distributed to many nodes, and by
river instance creation, a plan of execution is announced to all river
river instances should (similar to web crawlers) receive a list of URLs
of sources they can process in parallel. The URLs carry schemes for custom
URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
URLs would be a central task at river initiation phase, probably of the ES
master node, or the node that receives a river creation request. The state
of each (active) URL should be available in the cluster state
and, river instances should be identifiable by the cluster service by an
ID, and should respond with a state message if they are asked for a report.
Also, a river instance should be able to receive stop signals and react in
a predictable way (finishing the URL queue, finishing current URL then
abort the URL queue, or abort immediately)
river instances should be able to shutdown automatically if the list of
URLs they received is done and delete themselves from the active river
instance list in the cluster state
plan of execution could also be defined by a cron-like request
nodes should be configurable if they can run river instances or not
the number of river instances could also be a parameter in a river
creation request. So if the number of URLs to be processed exceed the
available river nodes, they would have to be executed in a queue
by providing a standard bulk indexing procedure in a new generic river
framework common to all rivers, writing custom code for rivers would reduce
to the mere task of handling a single URL for fetching data and construct
JSON documents in a stream-like manner, maybe with something like JSON-Path
keys for inserting values.
So many wishes.... sorry for that. But it's christmas time
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firstname.lastname@example.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d57f4561-9ee0-4674-9a8c-56cf4afb21ba%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.