JDBC river doesn't start after delete & recreate


(Justin Doles) #1

I'm having an issue with JDBC rivers. I'm running ES 0.90.7 and JDBC river
2.2.3. I have a 3 node cluster: 1 w/no data (Windows) and 2 w/ data
(Linux).

I can create a simple river initially.

{
"type" : "jdbc",
"jdbc" : {
"strategy" : "oneshot",
"driver" : "com.mysql.jdbc.Driver",
"url" : "jdbc:mysql://192.168.1.1:6033/test",
"user" : "test_account",
"password" : "test_password",
"sql" : "SELECT orders.id AS _id, orders.name,
orders.description, BinaryToGuid(orders.guid), orders.number
FROM test.orders;"
},
"index" : {
"index" : "orders",
"type" : "order"
}
}

This works for the initial load. If I stop the river while it's running by
deleting it, I cannot start another river with the same parameters until
stop ES on the data node that was processing this river. Once I do that,
the other node starts the river. Is there something I'm not understanding?
I don't see any errors in the logs.

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7614d944-d89a-463e-8c46-6e07134f230f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

Not sure if this is related to rivers in general.

The JDBC river runs in a separate thread and writes state info at each
cycle into the private river index. Deleting a river while the river is
running may not remove the river resources completely, or it may hang doing
this. This could explain why the cluster is thinking it should restart the
river instance at the other node. At least, I have not tested these
situations.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGKk6KAy-ZX0xWrc%2BQXnQ_4dSZT0%2BBebo-iOS8vX8CqSw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Justin Doles) #3

You may be right. I'm far from an expert in ES. If I delete the first
river (my_river_1) while it's running (delete is successful) then create a
new river (my_river_2) with the same parameters, the second river won't
start until I stop the ES node that was processing the first river.

I've tried waiting a few minutes between, but it doesn't seem to matter.

Justin

On Friday, December 6, 2013 2:19:09 PM UTC-5, Jörg Prante wrote:

Not sure if this is related to rivers in general.

The JDBC river runs in a separate thread and writes state info at each
cycle into the private river index. Deleting a river while the river is
running may not remove the river resources completely, or it may hang doing
this. This could explain why the cluster is thinking it should restart the
river instance at the other node. At least, I have not tested these
situations.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/10301994-02a9-492e-ab4d-2bf37235e115%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Justin Doles) #4

I did some more digging and you're right about it being a river issue (as
far as I can tell). It looks like the rivers all get assigned to a single
node . If I delete a river midway, it won't run any additional rivers that
are created. But once that node is shutdown, the other rivers I created
after the delete get assigned to another node and begin to process.

http://www.elasticsearch.org/guide/en/elasticsearch/rivers/current/index.html#allocation

Rivers are singletons within the cluster. They get allocated
automatically to one of the nodes and run. If that node fails, a river will
be automatically allocated to another node.

I can't tell if this is a bug or not though. Or maybe there's a time I
need to wait?

Justin

On Friday, December 6, 2013 3:21:51 PM UTC-5, Justin Doles wrote:

You may be right. I'm far from an expert in ES. If I delete the first
river (my_river_1) while it's running (delete is successful) then create a
new river (my_river_2) with the same parameters, the second river won't
start until I stop the ES node that was processing the first river.

I've tried waiting a few minutes between, but it doesn't seem to matter.

Justin

On Friday, December 6, 2013 2:19:09 PM UTC-5, Jörg Prante wrote:

Not sure if this is related to rivers in general.

The JDBC river runs in a separate thread and writes state info at each
cycle into the private river index. Deleting a river while the river is
running may not remove the river resources completely, or it may hang doing
this. This could explain why the cluster is thinking it should restart the
river instance at the other node. At least, I have not tested these
situations.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41c5fa8e-f5d9-4531-8cbf-761b0f2b9d1e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #5

I think the concept of river is broken. For example:

  • it is assumed that river instances shall always run. If a node fails, all
    the river instances on that node are started again on other nodes. The idea
    is to run river contiuously without interruption so no data gets lost.

  • the river cluster service does not watch what river instances are
    currently doing and what the river instance state is since the river
    instance state is private to the river.

  • if a river instance is deleted, the river cluster service must know this
    instance is permanently removed. But what is permanently if you can
    recreate a river instance under the same name after a deletion?

There were discussions that rivers may be deprecated in favor of message
queues like logstash.

I think it would be a good idea to improve the river concept to a truly
distributed design.

For example:

  • rivers should be aware of many river instances in parallel so they could
    share the work by dividing the workload

  • a river instance should always be distributed to many nodes, and by river
    instance creation, a plan of execution is announced to all river instances

  • river instances should (similar to web crawlers) receive a list of URLs
    of sources they can process in parallel. The URLs carry schemes for custom
    URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
    URLs would be a central task at river initiation phase, probably of the ES
    master node, or the node that receives a river creation request. The state
    of each (active) URL should be available in the cluster state

  • and, river instances should be identifiable by the cluster service by an
    ID, and should respond with a state message if they are asked for a report.
    Also, a river instance should be able to receive stop signals and react in
    a predictable way (finishing the URL queue, finishing current URL then
    abort the URL queue, or abort immediately)

  • river instances should be able to shutdown automatically if the list of
    URLs they received is done and delete themselves from the active river
    instance list in the cluster state

  • plan of execution could also be defined by a cron-like request

  • nodes should be configurable if they can run river instances or not

  • the number of river instances could also be a parameter in a river
    creation request. So if the number of URLs to be processed exceed the
    available river nodes, they would have to be executed in a queue

  • by providing a standard bulk indexing procedure in a new generic river
    framework common to all rivers, writing custom code for rivers would reduce
    to the mere task of handling a single URL for fetching data and construct
    JSON documents in a stream-like manner, maybe with something like JSON-Path
    keys for inserting values.

So many wishes.... sorry for that. But it's christmas time :wink:

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHDfOLg6ETjVxBdD-kOq1UfKA9Rt9qzWuJJ33VeS4OW_Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Gabe Gorelick-Feldman) #6

Given that rivers in general seem to be flawed, is there an easy way
clients can tell whether a JDBC river job is running (so they don't try to
delete it)? Maybe a field in the internal JDBC river document? I haven't
seen any documentation on the structure of that doc.

On Saturday, December 7, 2013 6:10:50 AM UTC-5, Jörg Prante wrote:

I think the concept of river is broken. For example:

  • it is assumed that river instances shall always run. If a node fails,
    all the river instances on that node are started again on other nodes. The
    idea is to run river contiuously without interruption so no data gets lost.

  • the river cluster service does not watch what river instances are
    currently doing and what the river instance state is since the river
    instance state is private to the river.

  • if a river instance is deleted, the river cluster service must know this
    instance is permanently removed. But what is permanently if you can
    recreate a river instance under the same name after a deletion?

There were discussions that rivers may be deprecated in favor of message
queues like logstash.

I think it would be a good idea to improve the river concept to a truly
distributed design.

For example:

  • rivers should be aware of many river instances in parallel so they could
    share the work by dividing the workload

  • a river instance should always be distributed to many nodes, and by
    river instance creation, a plan of execution is announced to all river
    instances

  • river instances should (similar to web crawlers) receive a list of URLs
    of sources they can process in parallel. The URLs carry schemes for custom
    URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
    URLs would be a central task at river initiation phase, probably of the ES
    master node, or the node that receives a river creation request. The state
    of each (active) URL should be available in the cluster state

  • and, river instances should be identifiable by the cluster service by an
    ID, and should respond with a state message if they are asked for a report.
    Also, a river instance should be able to receive stop signals and react in
    a predictable way (finishing the URL queue, finishing current URL then
    abort the URL queue, or abort immediately)

  • river instances should be able to shutdown automatically if the list of
    URLs they received is done and delete themselves from the active river
    instance list in the cluster state

  • plan of execution could also be defined by a cron-like request

  • nodes should be configurable if they can run river instances or not

  • the number of river instances could also be a parameter in a river
    creation request. So if the number of URLs to be processed exceed the
    available river nodes, they would have to be executed in a queue

  • by providing a standard bulk indexing procedure in a new generic river
    framework common to all rivers, writing custom code for rivers would reduce
    to the mere task of handling a single URL for fetching data and construct
    JSON documents in a stream-like manner, maybe with something like JSON-Path
    keys for inserting values.

So many wishes.... sorry for that. But it's christmas time :wink:

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/897645ca-013c-40d7-9f4c-102f2fac9912%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #7

Finding out if a JDBC river job runs has to be implemented, it is not
present yet.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHqysv5diJw_pKzk_CRz%2BFhFbhdgnNwCFhuEy6ua0VRHg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Gabe Gorelick-Feldman) #8

How hard would it be to implement? I'm happy to help if someone points me
in the right direction.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c8667406-1bee-47c8-82fa-1f02e1a79d10%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #9

It's very easy, I added an issue. An activity flag can be added to the
river state document.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHZk6sPZF_GN0%2BEpAUrmfTxcZA%2B%2BCz2v-nUrwmjoHt1hg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Justin Doles) #10

I did find a way to prevent nodes from running rivers.
node.river: false|true

I set this to true on both my data nodes and false on my master node. So
far so good.

I also noticed that multiple rivers ran on different nodes. I'm not
certain if this was a side effect of that setting or a coincidence. I'll
be doing more testing in the next couple weeks.

All your ideas have merit. There is definitely room to improve rivers.
Not always running and a reliable status would be huge.

Justin

On Saturday, December 7, 2013 6:10:50 AM UTC-5, Jörg Prante wrote:

I think the concept of river is broken. For example:

  • it is assumed that river instances shall always run. If a node fails,
    all the river instances on that node are started again on other nodes. The
    idea is to run river contiuously without interruption so no data gets lost.

  • the river cluster service does not watch what river instances are
    currently doing and what the river instance state is since the river
    instance state is private to the river.

  • if a river instance is deleted, the river cluster service must know this
    instance is permanently removed. But what is permanently if you can
    recreate a river instance under the same name after a deletion?

There were discussions that rivers may be deprecated in favor of message
queues like logstash.

I think it would be a good idea to improve the river concept to a truly
distributed design.

For example:

  • rivers should be aware of many river instances in parallel so they could
    share the work by dividing the workload

  • a river instance should always be distributed to many nodes, and by
    river instance creation, a plan of execution is announced to all river
    instances

  • river instances should (similar to web crawlers) receive a list of URLs
    of sources they can process in parallel. The URLs carry schemes for custom
    URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
    URLs would be a central task at river initiation phase, probably of the ES
    master node, or the node that receives a river creation request. The state
    of each (active) URL should be available in the cluster state

  • and, river instances should be identifiable by the cluster service by an
    ID, and should respond with a state message if they are asked for a report.
    Also, a river instance should be able to receive stop signals and react in
    a predictable way (finishing the URL queue, finishing current URL then
    abort the URL queue, or abort immediately)

  • river instances should be able to shutdown automatically if the list of
    URLs they received is done and delete themselves from the active river
    instance list in the cluster state

  • plan of execution could also be defined by a cron-like request

  • nodes should be configurable if they can run river instances or not

  • the number of river instances could also be a parameter in a river
    creation request. So if the number of URLs to be processed exceed the
    available river nodes, they would have to be executed in a queue

  • by providing a standard bulk indexing procedure in a new generic river
    framework common to all rivers, writing custom code for rivers would reduce
    to the mere task of handling a single URL for fetching data and construct
    JSON documents in a stream-like manner, maybe with something like JSON-Path
    keys for inserting values.

So many wishes.... sorry for that. But it's christmas time :wink:

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d57f4561-9ee0-4674-9a8c-56cf4afb21ba%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Justin Doles) #11

I swore I saw node.river accepted true|false, but it's none or a comma
separated list.
http://www.elasticsearch.org/guide/en/elasticsearch/rivers/current/index.html#allocation

On Monday, December 9, 2013 11:35:25 AM UTC-5, Justin Doles wrote:

I did find a way to prevent nodes from running rivers.
node.river: false|true

I set this to true on both my data nodes and false on my master node. So
far so good.

I also noticed that multiple rivers ran on different nodes. I'm not
certain if this was a side effect of that setting or a coincidence. I'll
be doing more testing in the next couple weeks.

All your ideas have merit. There is definitely room to improve rivers.
Not always running and a reliable status would be huge.

Justin

On Saturday, December 7, 2013 6:10:50 AM UTC-5, Jörg Prante wrote:

I think the concept of river is broken. For example:

  • it is assumed that river instances shall always run. If a node fails,
    all the river instances on that node are started again on other nodes. The
    idea is to run river contiuously without interruption so no data gets lost.

  • the river cluster service does not watch what river instances are
    currently doing and what the river instance state is since the river
    instance state is private to the river.

  • if a river instance is deleted, the river cluster service must know
    this instance is permanently removed. But what is permanently if you can
    recreate a river instance under the same name after a deletion?

There were discussions that rivers may be deprecated in favor of message
queues like logstash.

I think it would be a good idea to improve the river concept to a truly
distributed design.

For example:

  • rivers should be aware of many river instances in parallel so they
    could share the work by dividing the workload

  • a river instance should always be distributed to many nodes, and by
    river instance creation, a plan of execution is announced to all river
    instances

  • river instances should (similar to web crawlers) receive a list of URLs
    of sources they can process in parallel. The URLs carry schemes for custom
    URL handlers (like twitter://, wikipedia://, jdbc:// etc.) Dispatching the
    URLs would be a central task at river initiation phase, probably of the ES
    master node, or the node that receives a river creation request. The state
    of each (active) URL should be available in the cluster state

  • and, river instances should be identifiable by the cluster service by
    an ID, and should respond with a state message if they are asked for a
    report. Also, a river instance should be able to receive stop signals and
    react in a predictable way (finishing the URL queue, finishing current URL
    then abort the URL queue, or abort immediately)

  • river instances should be able to shutdown automatically if the list of
    URLs they received is done and delete themselves from the active river
    instance list in the cluster state

  • plan of execution could also be defined by a cron-like request

  • nodes should be configurable if they can run river instances or not

  • the number of river instances could also be a parameter in a river
    creation request. So if the number of URLs to be processed exceed the
    available river nodes, they would have to be executed in a queue

  • by providing a standard bulk indexing procedure in a new generic river
    framework common to all rivers, writing custom code for rivers would reduce
    to the mere task of handling a single URL for fetching data and construct
    JSON documents in a stream-like manner, maybe with something like JSON-Path
    keys for inserting values.

So many wishes.... sorry for that. But it's christmas time :wink:

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/509589f4-686a-4034-a122-4eb8b1fb75c9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #12

Yes, it's "none". False/true is not recognized. This peculiarity could be
easily fixed, the code is in
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/river/cluster/RiverNodeHelper.java#L40

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEBdNsxfs2tWtzHTzC-BdsjKAch2SRKiiw9MGTphBP0wg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #13