JDBC river doesn't start after delete & recreate

Justin_Doles · December 6, 2013, 6:28pm

I'm having an issue with JDBC rivers. I'm running ES 0.90.7 and JDBC river
2.2.3. I have a 3 node cluster: 1 w/no data (Windows) and 2 w/ data
(Linux).

I can create a simple river initially.

{
"type" : "jdbc",
"jdbc" : {
"strategy" : "oneshot",
"driver" : "com.mysql.jdbc.Driver",
"url" : "jdbc:mysql://192.168.1.1:6033/test",
"user" : "test_account",
"password" : "test_password",
"sql" : "SELECT orders.id AS _id, orders.name,
orders.description, BinaryToGuid(orders.guid), orders.number
FROM test.orders;"
},
"index" : {
"index" : "orders",
"type" : "order"
}
}

This works for the initial load. If I stop the river while it's running by
deleting it, I cannot start another river with the same parameters until
stop ES on the data node that was processing this river. Once I do that,
the other node starts the river. Is there something I'm not understanding?
I don't see any errors in the logs.

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7614d944-d89a-463e-8c46-6e07134f230f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · December 6, 2013, 7:19pm

Not sure if this is related to rivers in general.

The JDBC river runs in a separate thread and writes state info at each
cycle into the private river index. Deleting a river while the river is
running may not remove the river resources completely, or it may hang doing
this. This could explain why the cluster is thinking it should restart the
river instance at the other node. At least, I have not tested these
situations.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGKk6KAy-ZX0xWrc%2BQXnQ_4dSZT0%2BBebo-iOS8vX8CqSw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Justin_Doles · December 6, 2013, 8:21pm

You may be right. I'm far from an expert in ES. If I delete the first
river (my_river_1) while it's running (delete is successful) then create a
new river (my_river_2) with the same parameters, the second river won't
start until I stop the ES node that was processing the first river.

I've tried waiting a few minutes between, but it doesn't seem to matter.

Justin

On Friday, December 6, 2013 2:19:09 PM UTC-5, Jörg Prante wrote:

Not sure if this is related to rivers in general.

The JDBC river runs in a separate thread and writes state info at each
cycle into the private river index. Deleting a river while the river is
running may not remove the river resources completely, or it may hang doing
this. This could explain why the cluster is thinking it should restart the
river instance at the other node. At least, I have not tested these
situations.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/10301994-02a9-492e-ab4d-2bf37235e115%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Justin_Doles · December 7, 2013, 2:38am

I did some more digging and you're right about it being a river issue (as
far as I can tell). It looks like the rivers all get assigned to a single
node . If I delete a river midway, it won't run any additional rivers that
are created. But once that node is shutdown, the other rivers I created
after the delete get assigned to another node and begin to process.

Rivers are singletons within the cluster. They get allocated
automatically to one of the nodes and run. If that node fails, a river will
be automatically allocated to another node.

I can't tell if this is a bug or not though. Or maybe there's a time I
need to wait?

Justin

On Friday, December 6, 2013 3:21:51 PM UTC-5, Justin Doles wrote:

You may be right. I'm far from an expert in ES. If I delete the first
river (my_river_1) while it's running (delete is successful) then create a
new river (my_river_2) with the same parameters, the second river won't
start until I stop the ES node that was processing the first river.

I've tried waiting a few minutes between, but it doesn't seem to matter.

Justin

On Friday, December 6, 2013 2:19:09 PM UTC-5, Jörg Prante wrote:

Not sure if this is related to rivers in general.

The JDBC river runs in a separate thread and writes state info at each
cycle into the private river index. Deleting a river while the river is
running may not remove the river resources completely, or it may hang doing
this. This could explain why the cluster is thinking it should restart the
river instance at the other node. At least, I have not tested these
situations.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41c5fa8e-f5d9-4531-8cbf-761b0f2b9d1e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · December 7, 2013, 11:10am

I think the concept of river is broken. For example:

it is assumed that river instances shall always run. If a node fails, all
the river instances on that node are started again on other nodes. The idea
is to run river contiuously without interruption so no data gets lost.
the river cluster service does not watch what river instances are
currently doing and what the river instance state is since the river
instance state is private to the river.
if a river instance is deleted, the river cluster service must know this
instance is permanently removed. But what is permanently if you can
recreate a river instance under the same name after a deletion?

There were discussions that rivers may be deprecated in favor of message
queues like logstash.

I think it would be a good idea to improve the river concept to a truly
distributed design.

For example:

rivers should be aware of many river instances in parallel so they could
share the work by dividing the workload
a river instance should always be distributed to many nodes, and by river
instance creation, a plan of execution is announced to all river instances
river instances should (similar to web crawlers) receive a list of URLs
of sources they can process in parallel. The URLs carry schemes for custom
URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
URLs would be a central task at river initiation phase, probably of the ES
master node, or the node that receives a river creation request. The state
of each (active) URL should be available in the cluster state
and, river instances should be identifiable by the cluster service by an
ID, and should respond with a state message if they are asked for a report.
Also, a river instance should be able to receive stop signals and react in
a predictable way (finishing the URL queue, finishing current URL then
abort the URL queue, or abort immediately)
river instances should be able to shutdown automatically if the list of
URLs they received is done and delete themselves from the active river
instance list in the cluster state
plan of execution could also be defined by a cron-like request
nodes should be configurable if they can run river instances or not
the number of river instances could also be a parameter in a river
creation request. So if the number of URLs to be processed exceed the
available river nodes, they would have to be executed in a queue
by providing a standard bulk indexing procedure in a new generic river
framework common to all rivers, writing custom code for rivers would reduce
to the mere task of handling a single URL for fetching data and construct
JSON documents in a stream-like manner, maybe with something like JSON-Path
keys for inserting values.

So many wishes.... sorry for that. But it's christmas time

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHDfOLg6ETjVxBdD-kOq1UfKA9Rt9qzWuJJ33VeS4OW_Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Gabe_Gorelick_Feldma · December 7, 2013, 3:58pm

Given that rivers in general seem to be flawed, is there an easy way
clients can tell whether a JDBC river job is running (so they don't try to
delete it)? Maybe a field in the internal JDBC river document? I haven't
seen any documentation on the structure of that doc.

On Saturday, December 7, 2013 6:10:50 AM UTC-5, Jörg Prante wrote:

I think the concept of river is broken. For example:

it is assumed that river instances shall always run. If a node fails,
all the river instances on that node are started again on other nodes. The
idea is to run river contiuously without interruption so no data gets lost.

the river cluster service does not watch what river instances are
currently doing and what the river instance state is since the river
instance state is private to the river.

if a river instance is deleted, the river cluster service must know this
instance is permanently removed. But what is permanently if you can
recreate a river instance under the same name after a deletion?

There were discussions that rivers may be deprecated in favor of message
queues like logstash.

I think it would be a good idea to improve the river concept to a truly
distributed design.

For example:

rivers should be aware of many river instances in parallel so they could
share the work by dividing the workload

a river instance should always be distributed to many nodes, and by
river instance creation, a plan of execution is announced to all river
instances

river instances should (similar to web crawlers) receive a list of URLs
of sources they can process in parallel. The URLs carry schemes for custom
URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
URLs would be a central task at river initiation phase, probably of the ES
master node, or the node that receives a river creation request. The state
of each (active) URL should be available in the cluster state

and, river instances should be identifiable by the cluster service by an
ID, and should respond with a state message if they are asked for a report.
Also, a river instance should be able to receive stop signals and react in
a predictable way (finishing the URL queue, finishing current URL then
abort the URL queue, or abort immediately)

river instances should be able to shutdown automatically if the list of
URLs they received is done and delete themselves from the active river
instance list in the cluster state

plan of execution could also be defined by a cron-like request

nodes should be configurable if they can run river instances or not

the number of river instances could also be a parameter in a river
creation request. So if the number of URLs to be processed exceed the
available river nodes, they would have to be executed in a queue

by providing a standard bulk indexing procedure in a new generic river
framework common to all rivers, writing custom code for rivers would reduce
to the mere task of handling a single URL for fetching data and construct
JSON documents in a stream-like manner, maybe with something like JSON-Path
keys for inserting values.

So many wishes.... sorry for that. But it's christmas time

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/897645ca-013c-40d7-9f4c-102f2fac9912%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · December 7, 2013, 5:36pm

Finding out if a JDBC river job runs has to be implemented, it is not
present yet.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHqysv5diJw_pKzk_CRz%2BFhFbhdgnNwCFhuEy6ua0VRHg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Gabe_Gorelick_Feldma · December 7, 2013, 7:56pm

How hard would it be to implement? I'm happy to help if someone points me
in the right direction.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c8667406-1bee-47c8-82fa-1f02e1a79d10%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · December 7, 2013, 9:45pm

It's very easy, I added an issue. An activity flag can be added to the
river state document.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHZk6sPZF_GN0%2BEpAUrmfTxcZA%2B%2BCz2v-nUrwmjoHt1hg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Justin_Doles · December 9, 2013, 4:35pm

I did find a way to prevent nodes from running rivers.
node.river: false|true

I set this to true on both my data nodes and false on my master node. So
far so good.

I also noticed that multiple rivers ran on different nodes. I'm not
certain if this was a side effect of that setting or a coincidence. I'll
be doing more testing in the next couple weeks.

All your ideas have merit. There is definitely room to improve rivers.
Not always running and a reliable status would be huge.

Justin

On Saturday, December 7, 2013 6:10:50 AM UTC-5, Jörg Prante wrote:

I think the concept of river is broken. For example:

it is assumed that river instances shall always run. If a node fails,
all the river instances on that node are started again on other nodes. The
idea is to run river contiuously without interruption so no data gets lost.

the river cluster service does not watch what river instances are
currently doing and what the river instance state is since the river
instance state is private to the river.

if a river instance is deleted, the river cluster service must know this
instance is permanently removed. But what is permanently if you can
recreate a river instance under the same name after a deletion?

There were discussions that rivers may be deprecated in favor of message
queues like logstash.

I think it would be a good idea to improve the river concept to a truly
distributed design.

For example:

rivers should be aware of many river instances in parallel so they could
share the work by dividing the workload

a river instance should always be distributed to many nodes, and by
river instance creation, a plan of execution is announced to all river
instances

river instances should (similar to web crawlers) receive a list of URLs
of sources they can process in parallel. The URLs carry schemes for custom
URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
URLs would be a central task at river initiation phase, probably of the ES
master node, or the node that receives a river creation request. The state
of each (active) URL should be available in the cluster state

and, river instances should be identifiable by the cluster service by an
ID, and should respond with a state message if they are asked for a report.
Also, a river instance should be able to receive stop signals and react in
a predictable way (finishing the URL queue, finishing current URL then
abort the URL queue, or abort immediately)

river instances should be able to shutdown automatically if the list of
URLs they received is done and delete themselves from the active river
instance list in the cluster state

plan of execution could also be defined by a cron-like request

nodes should be configurable if they can run river instances or not

the number of river instances could also be a parameter in a river
creation request. So if the number of URLs to be processed exceed the
available river nodes, they would have to be executed in a queue

by providing a standard bulk indexing procedure in a new generic river
framework common to all rivers, writing custom code for rivers would reduce
to the mere task of handling a single URL for fetching data and construct
JSON documents in a stream-like manner, maybe with something like JSON-Path
keys for inserting values.

So many wishes.... sorry for that. But it's christmas time

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d57f4561-9ee0-4674-9a8c-56cf4afb21ba%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Justin_Doles · December 9, 2013, 4:41pm

I swore I saw node.river accepted true|false, but it's none or a comma
separated list.

On Monday, December 9, 2013 11:35:25 AM UTC-5, Justin Doles wrote:

I did find a way to prevent nodes from running rivers.
node.river: false|true

I set this to true on both my data nodes and false on my master node. So
far so good.

I also noticed that multiple rivers ran on different nodes. I'm not
certain if this was a side effect of that setting or a coincidence. I'll
be doing more testing in the next couple weeks.

All your ideas have merit. There is definitely room to improve rivers.
Not always running and a reliable status would be huge.

Justin

On Saturday, December 7, 2013 6:10:50 AM UTC-5, Jörg Prante wrote:

I think the concept of river is broken. For example:

it is assumed that river instances shall always run. If a node fails,
all the river instances on that node are started again on other nodes. The
idea is to run river contiuously without interruption so no data gets lost.

the river cluster service does not watch what river instances are
currently doing and what the river instance state is since the river
instance state is private to the river.

if a river instance is deleted, the river cluster service must know
this instance is permanently removed. But what is permanently if you can
recreate a river instance under the same name after a deletion?

There were discussions that rivers may be deprecated in favor of message
queues like logstash.

I think it would be a good idea to improve the river concept to a truly
distributed design.

For example:

rivers should be aware of many river instances in parallel so they
could share the work by dividing the workload

a river instance should always be distributed to many nodes, and by
river instance creation, a plan of execution is announced to all river
instances

river instances should (similar to web crawlers) receive a list of URLs
of sources they can process in parallel. The URLs carry schemes for custom
URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
URLs would be a central task at river initiation phase, probably of the ES
master node, or the node that receives a river creation request. The state
of each (active) URL should be available in the cluster state

and, river instances should be identifiable by the cluster service by
an ID, and should respond with a state message if they are asked for a
report. Also, a river instance should be able to receive stop signals and
react in a predictable way (finishing the URL queue, finishing current URL
then abort the URL queue, or abort immediately)

river instances should be able to shutdown automatically if the list of
URLs they received is done and delete themselves from the active river
instance list in the cluster state

plan of execution could also be defined by a cron-like request

nodes should be configurable if they can run river instances or not

the number of river instances could also be a parameter in a river
creation request. So if the number of URLs to be processed exceed the
available river nodes, they would have to be executed in a queue

by providing a standard bulk indexing procedure in a new generic river
framework common to all rivers, writing custom code for rivers would reduce
to the mere task of handling a single URL for fetching data and construct
JSON documents in a stream-like manner, maybe with something like JSON-Path
keys for inserting values.

So many wishes.... sorry for that. But it's christmas time

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/509589f4-686a-4034-a122-4eb8b1fb75c9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · December 9, 2013, 5:05pm

Yes, it's "none". False/true is not recognized. This peculiarity could be
easily fixed, the code is in
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/river/cluster/RiverNodeHelper.java#L40

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEBdNsxfs2tWtzHTzC-BdsjKAch2SRKiiw9MGTphBP0wg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Jdbc river won't start Elasticsearch	3	343	July 6, 2017
River not being migrated to new node on service restart (ES 0.15.0) Elasticsearch	4	378	July 6, 2017
Manually starting rivers? Elasticsearch	5	337	July 6, 2017
Can't start newly created river on several active nodes Elasticsearch	8	361	July 6, 2017
River does not start correctly after full cluster shutdown Elasticsearch	5	371	July 6, 2017

JDBC river doesn't start after delete & recreate

Related topics