River -- Possibility to notify the database once processed?


(boeledi) #1

Hi,

Is there any means for a River to be able to notify the database a soon as
fetched records have been processed? This would allow to know that both
database and ES are synchronised...

Many thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #2

Hi Didier,

On Mon, Sep 23, 2013 at 5:00 PM, boeledi didier.boelens@gmail.com wrote:

Is there any means for a River to be able to notify the database a soon as
fetched records have been processed? This would allow to know that both
database and ES are synchronised...

If you are working on synchronizing the content of a database with
Elasticsearch, I would recommend not using rivers at all but just writing a
simple script that would fetch rows from the database and push them to
Elasticsearch. This often ends up being simpler, and in your case it would
make it possible to let the databse know that the import is finished
without relying on the existence of a specific River API.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian 'Phunk' Gadoury) #3

I'm not sure why Adrien recommends against using a River. "Synchronizing
the content of a database with Elasticsearch" is exactly what rivers do.
They are also balanced and recoverable just like ES shards.

To see if a river has processed all the changes in a database, I have a
script that does this (for our CouchDB river):

  • curl search:9200/_river/my_river/_seq?pretty and parse the last_seq value
    out of that JSON

  • curl couchdb:5984/my_db and parse the update_seq value out of that JSON

If they match, your river is up to date. If your river's last_seq is lower
than your databases update_seq, then your river is not up to date yet.

You can also query that river doc on a loop to determine if your river is
doing anything or if it's idle.

-Brian

On Monday, September 23, 2013 9:47:46 AM UTC-6, Adrien Grand wrote:

Hi Didier,

On Mon, Sep 23, 2013 at 5:00 PM, boeledi <didier....@gmail.com<javascript:>

wrote:

Is there any means for a River to be able to notify the database a soon
as fetched records have been processed? This would allow to know that both
database and ES are synchronised...

If you are working on synchronizing the content of a database with
Elasticsearch, I would recommend not using rivers at all but just writing a
simple script that would fetch rows from the database and push them to
Elasticsearch. This often ends up being simpler, and in your case it would
make it possible to let the databse know that the import is finished
without relying on the existence of a specific River API.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #4

Because rivers are somehow an external process that run into a node for something else than indexing and searching.
Imagine that you want to run OCR on PDF documents. You know that this is really intensive in term of CPU usage, right?

Does it make sense to have that heavy process running in an elasticsearch node?

It could be better to have that process outside elasticsearch itself.

Rivers are nice when you discover elasticsearch. My personal experience is that you often move from rivers to another process (batch, ETL, logstash…) to have a finer control of this process.
And a river is singleton. It does not scale.

I think that's what Adrien explained.

My 0.02 cents.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 23 sept. 2013 à 22:46, Brian Gadoury bgadoury@endpoint.com a écrit :

I'm not sure why Adrien recommends against using a River. "Synchronizing the content of a database with Elasticsearch" is exactly what rivers do. They are also balanced and recoverable just like ES shards.

To see if a river has processed all the changes in a database, I have a script that does this (for our CouchDB river):

  • curl search:9200/_river/my_river/_seq?pretty and parse the last_seq value out of that JSON

  • curl couchdb:5984/my_db and parse the update_seq value out of that JSON

If they match, your river is up to date. If your river's last_seq is lower than your databases update_seq, then your river is not up to date yet.

You can also query that river doc on a loop to determine if your river is doing anything or if it's idle.

-Brian

On Monday, September 23, 2013 9:47:46 AM UTC-6, Adrien Grand wrote:
Hi Didier,

On Mon, Sep 23, 2013 at 5:00 PM, boeledi didier....@gmail.com wrote:
Is there any means for a River to be able to notify the database a soon as fetched records have been processed? This would allow to know that both database and ES are synchronised...

If you are working on synchronizing the content of a database with Elasticsearch, I would recommend not using rivers at all but just writing a simple script that would fetch rows from the database and push them to Elasticsearch. This often ends up being simpler, and in your case it would make it possible to let the databse know that the import is finished without relying on the existence of a specific River API.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian 'Phunk' Gadoury) #5

Hi David,

I understand what you're saying, but Didier didn't write anything about
running CPU intensive work like OCR on PDF documents so I don't see how
that applies to this conversation.

It seemed like Didier described the standard use case for a river. (Based
on the rather scant details, I'll admit.)

-Brian

David Pilato wrote:

Because rivers are somehow an external process that run into a node for
something else than indexing and searching.
Imagine that you want to run OCR on PDF documents. You know that this is
really intensive in term of CPU usage, right?

Does it make sense to have that heavy process running in an elasticsearch
node?

It could be better to have that process outside elasticsearch itself.

Rivers are nice when you discover elasticsearch. My personal experience is
that you often move from rivers to another process (batch, ETL, logstash…)
to have a finer control of this process.
And a river is singleton. It does not scale.

I think that's what Adrien explained.

My 0.02 cents.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #6

If you study the JDBC river, there is an optional acknowledging mechanism
(which is not properly working unfortunately). Acknowledging bulk requests
is done by an extra write connection back to the DB into a special table
that has to be prepared.

But it is not meant for synchronization. Syncing data is very hard because
you have to follow the same data model, including modifications, additions,
deletions. In fact, creating JSON out of table rows and vice versa is a
non-straightforward task, there are many variants. There is no natural
equivalence between a DB table and Elasticseach documents. It is true that
rivers do not scale well - if you wanted millions of change events
propagated to ES in a few seconds, this would cause heavy congestion.

As said by Adrien and David, it is easier to detect changes and push data
from the DB to a target like ES, preferably with the help of triggers and
bulk ingest, and after success, do some accounting about the operation with
SQL at the DB side.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7