I'm having an issue with JDBC rivers. I'm running ES 0.90.7 and JDBC river
2.2.3. I have a 3 node cluster: 1 w/no data (Windows) and 2 w/ data
(Linux).
This works for the initial load. If I stop the river while it's running by
deleting it, I cannot start another river with the same parameters until
stop ES on the data node that was processing this river. Once I do that,
the other node starts the river. Is there something I'm not understanding?
I don't see any errors in the logs.
The JDBC river runs in a separate thread and writes state info at each
cycle into the private river index. Deleting a river while the river is
running may not remove the river resources completely, or it may hang doing
this. This could explain why the cluster is thinking it should restart the
river instance at the other node. At least, I have not tested these
situations.
You may be right. I'm far from an expert in ES. If I delete the first
river (my_river_1) while it's running (delete is successful) then create a
new river (my_river_2) with the same parameters, the second river won't
start until I stop the ES node that was processing the first river.
I've tried waiting a few minutes between, but it doesn't seem to matter.
Justin
On Friday, December 6, 2013 2:19:09 PM UTC-5, Jörg Prante wrote:
Not sure if this is related to rivers in general.
The JDBC river runs in a separate thread and writes state info at each
cycle into the private river index. Deleting a river while the river is
running may not remove the river resources completely, or it may hang doing
this. This could explain why the cluster is thinking it should restart the
river instance at the other node. At least, I have not tested these
situations.
I did some more digging and you're right about it being a river issue (as
far as I can tell). It looks like the rivers all get assigned to a single
node . If I delete a river midway, it won't run any additional rivers that
are created. But once that node is shutdown, the other rivers I created
after the delete get assigned to another node and begin to process.
Rivers are singletons within the cluster. They get allocated
automatically to one of the nodes and run. If that node fails, a river will
be automatically allocated to another node.
I can't tell if this is a bug or not though. Or maybe there's a time I
need to wait?
Justin
On Friday, December 6, 2013 3:21:51 PM UTC-5, Justin Doles wrote:
You may be right. I'm far from an expert in ES. If I delete the first
river (my_river_1) while it's running (delete is successful) then create a
new river (my_river_2) with the same parameters, the second river won't
start until I stop the ES node that was processing the first river.
I've tried waiting a few minutes between, but it doesn't seem to matter.
Justin
On Friday, December 6, 2013 2:19:09 PM UTC-5, Jörg Prante wrote:
Not sure if this is related to rivers in general.
The JDBC river runs in a separate thread and writes state info at each
cycle into the private river index. Deleting a river while the river is
running may not remove the river resources completely, or it may hang doing
this. This could explain why the cluster is thinking it should restart the
river instance at the other node. At least, I have not tested these
situations.
I think the concept of river is broken. For example:
it is assumed that river instances shall always run. If a node fails, all
the river instances on that node are started again on other nodes. The idea
is to run river contiuously without interruption so no data gets lost.
the river cluster service does not watch what river instances are
currently doing and what the river instance state is since the river
instance state is private to the river.
if a river instance is deleted, the river cluster service must know this
instance is permanently removed. But what is permanently if you can
recreate a river instance under the same name after a deletion?
There were discussions that rivers may be deprecated in favor of message
queues like logstash.
I think it would be a good idea to improve the river concept to a truly
distributed design.
For example:
rivers should be aware of many river instances in parallel so they could
share the work by dividing the workload
a river instance should always be distributed to many nodes, and by river
instance creation, a plan of execution is announced to all river instances
river instances should (similar to web crawlers) receive a list of URLs
of sources they can process in parallel. The URLs carry schemes for custom
URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
URLs would be a central task at river initiation phase, probably of the ES
master node, or the node that receives a river creation request. The state
of each (active) URL should be available in the cluster state
and, river instances should be identifiable by the cluster service by an
ID, and should respond with a state message if they are asked for a report.
Also, a river instance should be able to receive stop signals and react in
a predictable way (finishing the URL queue, finishing current URL then
abort the URL queue, or abort immediately)
river instances should be able to shutdown automatically if the list of
URLs they received is done and delete themselves from the active river
instance list in the cluster state
plan of execution could also be defined by a cron-like request
nodes should be configurable if they can run river instances or not
the number of river instances could also be a parameter in a river
creation request. So if the number of URLs to be processed exceed the
available river nodes, they would have to be executed in a queue
by providing a standard bulk indexing procedure in a new generic river
framework common to all rivers, writing custom code for rivers would reduce
to the mere task of handling a single URL for fetching data and construct
JSON documents in a stream-like manner, maybe with something like JSON-Path
keys for inserting values.
So many wishes.... sorry for that. But it's christmas time
Given that rivers in general seem to be flawed, is there an easy way
clients can tell whether a JDBC river job is running (so they don't try to
delete it)? Maybe a field in the internal JDBC river document? I haven't
seen any documentation on the structure of that doc.
On Saturday, December 7, 2013 6:10:50 AM UTC-5, Jörg Prante wrote:
I think the concept of river is broken. For example:
it is assumed that river instances shall always run. If a node fails,
all the river instances on that node are started again on other nodes. The
idea is to run river contiuously without interruption so no data gets lost.
the river cluster service does not watch what river instances are
currently doing and what the river instance state is since the river
instance state is private to the river.
if a river instance is deleted, the river cluster service must know this
instance is permanently removed. But what is permanently if you can
recreate a river instance under the same name after a deletion?
There were discussions that rivers may be deprecated in favor of message
queues like logstash.
I think it would be a good idea to improve the river concept to a truly
distributed design.
For example:
rivers should be aware of many river instances in parallel so they could
share the work by dividing the workload
a river instance should always be distributed to many nodes, and by
river instance creation, a plan of execution is announced to all river
instances
river instances should (similar to web crawlers) receive a list of URLs
of sources they can process in parallel. The URLs carry schemes for custom
URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
URLs would be a central task at river initiation phase, probably of the ES
master node, or the node that receives a river creation request. The state
of each (active) URL should be available in the cluster state
and, river instances should be identifiable by the cluster service by an
ID, and should respond with a state message if they are asked for a report.
Also, a river instance should be able to receive stop signals and react in
a predictable way (finishing the URL queue, finishing current URL then
abort the URL queue, or abort immediately)
river instances should be able to shutdown automatically if the list of
URLs they received is done and delete themselves from the active river
instance list in the cluster state
plan of execution could also be defined by a cron-like request
nodes should be configurable if they can run river instances or not
the number of river instances could also be a parameter in a river
creation request. So if the number of URLs to be processed exceed the
available river nodes, they would have to be executed in a queue
by providing a standard bulk indexing procedure in a new generic river
framework common to all rivers, writing custom code for rivers would reduce
to the mere task of handling a single URL for fetching data and construct
JSON documents in a stream-like manner, maybe with something like JSON-Path
keys for inserting values.
So many wishes.... sorry for that. But it's christmas time
I did find a way to prevent nodes from running rivers.
node.river: false|true
I set this to true on both my data nodes and false on my master node. So
far so good.
I also noticed that multiple rivers ran on different nodes. I'm not
certain if this was a side effect of that setting or a coincidence. I'll
be doing more testing in the next couple weeks.
All your ideas have merit. There is definitely room to improve rivers.
Not always running and a reliable status would be huge.
Justin
On Saturday, December 7, 2013 6:10:50 AM UTC-5, Jörg Prante wrote:
I think the concept of river is broken. For example:
it is assumed that river instances shall always run. If a node fails,
all the river instances on that node are started again on other nodes. The
idea is to run river contiuously without interruption so no data gets lost.
the river cluster service does not watch what river instances are
currently doing and what the river instance state is since the river
instance state is private to the river.
if a river instance is deleted, the river cluster service must know this
instance is permanently removed. But what is permanently if you can
recreate a river instance under the same name after a deletion?
There were discussions that rivers may be deprecated in favor of message
queues like logstash.
I think it would be a good idea to improve the river concept to a truly
distributed design.
For example:
rivers should be aware of many river instances in parallel so they could
share the work by dividing the workload
a river instance should always be distributed to many nodes, and by
river instance creation, a plan of execution is announced to all river
instances
river instances should (similar to web crawlers) receive a list of URLs
of sources they can process in parallel. The URLs carry schemes for custom
URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
URLs would be a central task at river initiation phase, probably of the ES
master node, or the node that receives a river creation request. The state
of each (active) URL should be available in the cluster state
and, river instances should be identifiable by the cluster service by an
ID, and should respond with a state message if they are asked for a report.
Also, a river instance should be able to receive stop signals and react in
a predictable way (finishing the URL queue, finishing current URL then
abort the URL queue, or abort immediately)
river instances should be able to shutdown automatically if the list of
URLs they received is done and delete themselves from the active river
instance list in the cluster state
plan of execution could also be defined by a cron-like request
nodes should be configurable if they can run river instances or not
the number of river instances could also be a parameter in a river
creation request. So if the number of URLs to be processed exceed the
available river nodes, they would have to be executed in a queue
by providing a standard bulk indexing procedure in a new generic river
framework common to all rivers, writing custom code for rivers would reduce
to the mere task of handling a single URL for fetching data and construct
JSON documents in a stream-like manner, maybe with something like JSON-Path
keys for inserting values.
So many wishes.... sorry for that. But it's christmas time
I swore I saw node.river accepted true|false, but it's none or a comma
separated list.
On Monday, December 9, 2013 11:35:25 AM UTC-5, Justin Doles wrote:
I did find a way to prevent nodes from running rivers.
node.river: false|true
I set this to true on both my data nodes and false on my master node. So
far so good.
I also noticed that multiple rivers ran on different nodes. I'm not
certain if this was a side effect of that setting or a coincidence. I'll
be doing more testing in the next couple weeks.
All your ideas have merit. There is definitely room to improve rivers.
Not always running and a reliable status would be huge.
Justin
On Saturday, December 7, 2013 6:10:50 AM UTC-5, Jörg Prante wrote:
I think the concept of river is broken. For example:
it is assumed that river instances shall always run. If a node fails,
all the river instances on that node are started again on other nodes. The
idea is to run river contiuously without interruption so no data gets lost.
the river cluster service does not watch what river instances are
currently doing and what the river instance state is since the river
instance state is private to the river.
if a river instance is deleted, the river cluster service must know
this instance is permanently removed. But what is permanently if you can
recreate a river instance under the same name after a deletion?
There were discussions that rivers may be deprecated in favor of message
queues like logstash.
I think it would be a good idea to improve the river concept to a truly
distributed design.
For example:
rivers should be aware of many river instances in parallel so they
could share the work by dividing the workload
a river instance should always be distributed to many nodes, and by
river instance creation, a plan of execution is announced to all river
instances
river instances should (similar to web crawlers) receive a list of URLs
of sources they can process in parallel. The URLs carry schemes for custom
URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
URLs would be a central task at river initiation phase, probably of the ES
master node, or the node that receives a river creation request. The state
of each (active) URL should be available in the cluster state
and, river instances should be identifiable by the cluster service by
an ID, and should respond with a state message if they are asked for a
report. Also, a river instance should be able to receive stop signals and
react in a predictable way (finishing the URL queue, finishing current URL
then abort the URL queue, or abort immediately)
river instances should be able to shutdown automatically if the list of
URLs they received is done and delete themselves from the active river
instance list in the cluster state
plan of execution could also be defined by a cron-like request
nodes should be configurable if they can run river instances or not
the number of river instances could also be a parameter in a river
creation request. So if the number of URLs to be processed exceed the
available river nodes, they would have to be executed in a queue
by providing a standard bulk indexing procedure in a new generic river
framework common to all rivers, writing custom code for rivers would reduce
to the mere task of handling a single URL for fetching data and construct
JSON documents in a stream-like manner, maybe with something like JSON-Path
keys for inserting values.
So many wishes.... sorry for that. But it's christmas time
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.