Rivers are reimporting data at each ElasticSearch restart

Hello,

I have a question about the fact that, when rivers are used to import data
into ElasticSearch, rivers are also reimporting data at each ElasticSearch
restart.

In our project, what we are doing is as follows :

  • Raw data is imported into ElasticSearch from a MySQL database using
    the JDBC river (https://github.com/jprante/elasticsearch-river-jdbc);
  • Some updates are executed directly on the newly imported data in
    ElasticSearch using POST requests;
  • In the end, the final data stored in ElasticSearch is not the same
    than the imported raw data.

The problem we are facing is that when ElasticSearch is restarted, the JDBC
river is reimporting the raw data thus overriding the transformations made.
We suppose that this is an intentional behavior from ElasticSearch rivers.
One solution to avoid the reimporting of data is to delete the
corresponding _river index, which is supposed to store the state of the
rivers.

Our questions are as follows :

  • Is the reimporting of data from rivers at each restart is a standard
    use case ? Is it useful for some applications ?
  • What is the point of the _river index state saving ?
    • Is there a way to avoid the reimporting of data without having to
      delete the corresponding _river index ?
    • Is there any downsides (for our use case) to delete the
      corresponding _river index ?

Thanks,
Stéphane.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

It is up to the river implementation how the data import is handled.

The JDBC river, in the "simple" strategy, imports data when the river is
started, regardless of existing cluster or index. It is possible to
implement other strategies, for example, a strategy that performs a check
before indexing.

There is no support for river implementations about node start/stop control
and how to behave. JDBC river tries to compensate this by persisting a JDBC
river specific state. This state is useful for flow control.

If you do no longer need the river, you can delete the river with curl
-XDELETE, this shuts down river instance threads gracefully and releases
resources.

If you delete the _river index with curl -XDELETE, you wipe all data that
is used by rivers. Active river instances are not stopped and are not aware
of what happened, so this is an unfriendly way to terminate river runs, all
kind of river errors may occur.

Jörg

On Wed, Jun 25, 2014 at 5:38 PM, Stéphane Seng seng.stephane@gmail.com
wrote:

Hello,

I have a question about the fact that, when rivers are used to import data
into Elasticsearch, rivers are also reimporting data at each Elasticsearch
restart.

In our project, what we are doing is as follows :

  • Raw data is imported into Elasticsearch from a MySQL database using
    the JDBC river (GitHub - jprante/elasticsearch-jdbc: JDBC importer for Elasticsearch);
  • Some updates are executed directly on the newly imported data in
    Elasticsearch using POST requests;
  • In the end, the final data stored in Elasticsearch is not the same
    than the imported raw data.

The problem we are facing is that when Elasticsearch is restarted, the
JDBC river is reimporting the raw data thus overriding the transformations
made.
We suppose that this is an intentional behavior from Elasticsearch rivers.
One solution to avoid the reimporting of data is to delete the
corresponding _river index, which is supposed to store the state of the
rivers.

Our questions are as follows :

  • Is the reimporting of data from rivers at each restart is a standard
    use case ? Is it useful for some applications ?
  • What is the point of the _river index state saving ?
    • Is there a way to avoid the reimporting of data without having to
      delete the corresponding _river index ?
    • Is there any downsides (for our use case) to delete the
      corresponding _river index ?

Thanks,
Stéphane.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHW4ZeQV4Op9QuB4XJpMOht3P-Eq5ouJ0tsK3UU6dqD2Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hello,
This post interested me.
Have we a way to know when indexing is finished and thus triggered the
XDELETE _river?

Le mercredi 25 juin 2014 17:54:01 UTC+2, Jörg Prante a écrit :

It is up to the river implementation how the data import is handled.

The JDBC river, in the "simple" strategy, imports data when the river is
started, regardless of existing cluster or index. It is possible to
implement other strategies, for example, a strategy that performs a check
before indexing.

There is no support for river implementations about node start/stop
control and how to behave. JDBC river tries to compensate this by
persisting a JDBC river specific state. This state is useful for flow
control.

If you do no longer need the river, you can delete the river with curl
-XDELETE, this shuts down river instance threads gracefully and releases
resources.

If you delete the _river index with curl -XDELETE, you wipe all data that
is used by rivers. Active river instances are not stopped and are not aware
of what happened, so this is an unfriendly way to terminate river runs, all
kind of river errors may occur.

Jörg

On Wed, Jun 25, 2014 at 5:38 PM, Stéphane Seng <seng.s...@gmail.com
<javascript:>> wrote:

Hello,

I have a question about the fact that, when rivers are used to import
data into Elasticsearch, rivers are also reimporting data at each
Elasticsearch restart.

In our project, what we are doing is as follows :

  • Raw data is imported into Elasticsearch from a MySQL database using
    the JDBC river (GitHub - jprante/elasticsearch-jdbc: JDBC importer for Elasticsearch);
  • Some updates are executed directly on the newly imported data in
    Elasticsearch using POST requests;
  • In the end, the final data stored in Elasticsearch is not the same
    than the imported raw data.

The problem we are facing is that when Elasticsearch is restarted, the
JDBC river is reimporting the raw data thus overriding the transformations
made.
We suppose that this is an intentional behavior from Elasticsearch rivers.
One solution to avoid the reimporting of data is to delete the
corresponding _river index, which is supposed to store the state of the
rivers.

Our questions are as follows :

  • Is the reimporting of data from rivers at each restart is a
    standard use case ? Is it useful for some applications ?
  • What is the point of the _river index state saving ?
    • Is there a way to avoid the reimporting of data without having
      to delete the corresponding _river index ?
    • Is there any downsides (for our use case) to delete the
      corresponding _river index ?

Thanks,
Stéphane.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2b7f91f1-4fa0-4e66-8193-cd0e6fa35982%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Because each river can freely implement the data fetch, ES does not offer
river monitoring.

For JDBC river, I implemented some primitive river state query commands
that allow polling for river state changes.

Jörg

On Wed, Jun 25, 2014 at 6:00 PM, Tanguy Bernard <bernardtanguy1pro@gmail.com

wrote:

Hello,
This post interested me.
Have we a way to know when indexing is finished and thus triggered the
XDELETE _river?

Le mercredi 25 juin 2014 17:54:01 UTC+2, Jörg Prante a écrit :

It is up to the river implementation how the data import is handled.

The JDBC river, in the "simple" strategy, imports data when the river is
started, regardless of existing cluster or index. It is possible to
implement other strategies, for example, a strategy that performs a check
before indexing.

There is no support for river implementations about node start/stop
control and how to behave. JDBC river tries to compensate this by
persisting a JDBC river specific state. This state is useful for flow
control.

If you do no longer need the river, you can delete the river with curl
-XDELETE, this shuts down river instance threads gracefully and releases
resources.

If you delete the _river index with curl -XDELETE, you wipe all data that
is used by rivers. Active river instances are not stopped and are not aware
of what happened, so this is an unfriendly way to terminate river runs, all
kind of river errors may occur.

Jörg

On Wed, Jun 25, 2014 at 5:38 PM, Stéphane Seng seng.s...@gmail.com
wrote:

Hello,

I have a question about the fact that, when rivers are used to import
data into Elasticsearch, rivers are also reimporting data at each
Elasticsearch restart.

In our project, what we are doing is as follows :

  • Raw data is imported into Elasticsearch from a MySQL database
    using the JDBC river (jprante (Jörg Prante) · GitHub
    elasticsearch-river-jdbc);
  • Some updates are executed directly on the newly imported data in
    Elasticsearch using POST requests;
  • In the end, the final data stored in Elasticsearch is not the same
    than the imported raw data.

The problem we are facing is that when Elasticsearch is restarted, the
JDBC river is reimporting the raw data thus overriding the transformations
made.
We suppose that this is an intentional behavior from Elasticsearch
rivers.
One solution to avoid the reimporting of data is to delete the
corresponding _river index, which is supposed to store the state of the
rivers.

Our questions are as follows :

  • Is the reimporting of data from rivers at each restart is a
    standard use case ? Is it useful for some applications ?
  • What is the point of the _river index state saving ?
    • Is there a way to avoid the reimporting of data without having
      to delete the corresponding _river index ?
    • Is there any downsides (for our use case) to delete the
      corresponding _river index ?

Thanks,
Stéphane.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2b7f91f1-4fa0-4e66-8193-cd0e6fa35982%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2b7f91f1-4fa0-4e66-8193-cd0e6fa35982%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFJiu0%3DX7LUnP1irjs5s4kzQihE1HWBM-X-H%2BBtMMTkhw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for your quick reply,

I need some clarifications about what you meant by "delete the river",
"delete the _river index" and by "this state is useful for flow control".

From what I have understand from your reply and supposing that I have
imported data into a "documents" river using the JDBC river:

  • "Delete the river" means "DELETE _river/documents" (and does not mean
    "DELETE documents"):
    • This does not affect the already imported data.
    • The data is not reimported into Elasticsearch at restart.
    • Everything is fine for our use case.
  • "Delete the _river index" means "DELETE _river":
    • This does not affect the already imported data.
    • The data is not reimported into Elasticsearch at restart.
    • This should not be done because it affects all the rivers at the
      same time (for the documents river, it is equivalent of doing "DELETE
      _river/documents").
  • "This state is useful for flow control" means that:
    • The state keeps track of what data is already imported so that the
      same raw data (left untouched in Elasticsearch) is not reimported multiple
      times ?
    • OR The state keeps a trace of the SQL query so that, in case of an
      error during a node start/stop, the river can be automatically replayed ?

Thanks again,
Stéphane.

On Wednesday, June 25, 2014 6:08:52 PM UTC+2, Jörg Prante wrote:

Because each river can freely implement the data fetch, ES does not offer
river monitoring.

For JDBC river, I implemented some primitive river state query commands
that allow polling for river state changes.

Jörg

On Wed, Jun 25, 2014 at 6:00 PM, Tanguy Bernard <bernardt...@gmail.com
<javascript:>> wrote:

Hello,
This post interested me.
Have we a way to know when indexing is finished and thus triggered the
XDELETE _river?

Le mercredi 25 juin 2014 17:54:01 UTC+2, Jörg Prante a écrit :

It is up to the river implementation how the data import is handled.

The JDBC river, in the "simple" strategy, imports data when the river is
started, regardless of existing cluster or index. It is possible to
implement other strategies, for example, a strategy that performs a check
before indexing.

There is no support for river implementations about node start/stop
control and how to behave. JDBC river tries to compensate this by
persisting a JDBC river specific state. This state is useful for flow
control.

If you do no longer need the river, you can delete the river with curl
-XDELETE, this shuts down river instance threads gracefully and releases
resources.

If you delete the _river index with curl -XDELETE, you wipe all data
that is used by rivers. Active river instances are not stopped and are not
aware of what happened, so this is an unfriendly way to terminate river
runs, all kind of river errors may occur.

Jörg

On Wed, Jun 25, 2014 at 5:38 PM, Stéphane Seng seng.s...@gmail.com
wrote:

Hello,

I have a question about the fact that, when rivers are used to import
data into Elasticsearch, rivers are also reimporting data at each
Elasticsearch restart.

In our project, what we are doing is as follows :

  • Raw data is imported into Elasticsearch from a MySQL database
    using the JDBC river (jprante (Jörg Prante) · GitHub
    elasticsearch-river-jdbc);
  • Some updates are executed directly on the newly imported data in
    Elasticsearch using POST requests;
  • In the end, the final data stored in Elasticsearch is not the
    same than the imported raw data.

The problem we are facing is that when Elasticsearch is restarted, the
JDBC river is reimporting the raw data thus overriding the transformations
made.
We suppose that this is an intentional behavior from Elasticsearch
rivers.
One solution to avoid the reimporting of data is to delete the
corresponding _river index, which is supposed to store the state of the
rivers.

Our questions are as follows :

  • Is the reimporting of data from rivers at each restart is a
    standard use case ? Is it useful for some applications ?
  • What is the point of the _river index state saving ?
    • Is there a way to avoid the reimporting of data without having
      to delete the corresponding _river index ?
    • Is there any downsides (for our use case) to delete the
      corresponding _river index ?

Thanks,
Stéphane.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2b7f91f1-4fa0-4e66-8193-cd0e6fa35982%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2b7f91f1-4fa0-4e66-8193-cd0e6fa35982%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1a91a264-f53a-49c7-91f4-1438b9de3e91%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yes, removing a river is DELETE _river/rivername and deleting river index
is DELETE _river

The JDBC river state keeps track of some timestamps, counters, and the last
row of SQL statement. Yes, in case of a node switchover, where the river
instance is restarted on another node, the new node could pick up current
known state. But, all SQL statements will start again, so the SQL
statements must carry the full logic that decides where to restart from.
There is no magic in the JDBC river state that prevents missing data
fetches while river instance switchovers.

JDBC river state can not be used to find out what data is imported. For
this, the JDBC river code has to be extended. For example, a customized
"strategy" could be implemented for the JDBC plugin that can send a query
to ES before SQL statements are executed, with application specific code
that can find out about what data to expect in ES, so it knows when to stop
importing.

Jörg

On Thu, Jun 26, 2014 at 10:54 AM, Stéphane Seng seng.stephane@gmail.com
wrote:

Thanks for your quick reply,

I need some clarifications about what you meant by "delete the river",
"delete the _river index" and by "this state is useful for flow control".

From what I have understand from your reply and supposing that I have
imported data into a "documents" river using the JDBC river:

  • "Delete the river" means "DELETE _river/documents" (and does not
    mean "DELETE documents"):
    • This does not affect the already imported data.
    • The data is not reimported into Elasticsearch at restart.
    • Everything is fine for our use case.
  • "Delete the _river index" means "DELETE _river":
    • This does not affect the already imported data.
    • The data is not reimported into Elasticsearch at restart.
    • This should not be done because it affects all the rivers at the
      same time (for the documents river, it is equivalent of doing "DELETE
      _river/documents").
  • "This state is useful for flow control" means that:
    • The state keeps track of what data is already imported so that
      the same raw data (left untouched in Elasticsearch) is not reimported
      multiple times ?
    • OR The state keeps a trace of the SQL query so that, in case of
      an error during a node start/stop, the river can be automatically replayed ?

Thanks again,
Stéphane.

On Wednesday, June 25, 2014 6:08:52 PM UTC+2, Jörg Prante wrote:

Because each river can freely implement the data fetch, ES does not offer
river monitoring.

For JDBC river, I implemented some primitive river state query commands
that allow polling for river state changes.

Jörg

On Wed, Jun 25, 2014 at 6:00 PM, Tanguy Bernard bernardt...@gmail.com
wrote:

Hello,
This post interested me.
Have we a way to know when indexing is finished and thus triggered the
XDELETE _river?

Le mercredi 25 juin 2014 17:54:01 UTC+2, Jörg Prante a écrit :

It is up to the river implementation how the data import is handled.

The JDBC river, in the "simple" strategy, imports data when the river
is started, regardless of existing cluster or index. It is possible to
implement other strategies, for example, a strategy that performs a check
before indexing.

There is no support for river implementations about node start/stop
control and how to behave. JDBC river tries to compensate this by
persisting a JDBC river specific state. This state is useful for flow
control.

If you do no longer need the river, you can delete the river with curl
-XDELETE, this shuts down river instance threads gracefully and releases
resources.

If you delete the _river index with curl -XDELETE, you wipe all data
that is used by rivers. Active river instances are not stopped and are not
aware of what happened, so this is an unfriendly way to terminate river
runs, all kind of river errors may occur.

Jörg

On Wed, Jun 25, 2014 at 5:38 PM, Stéphane Seng seng.s...@gmail.com
wrote:

Hello,

I have a question about the fact that, when rivers are used to import
data into Elasticsearch, rivers are also reimporting data at each
Elasticsearch restart.

In our project, what we are doing is as follows :

  • Raw data is imported into Elasticsearch from a MySQL database
    using the JDBC river (https://github.com/jprante/el
    asticsearch-river-jdbc);
  • Some updates are executed directly on the newly imported data in
    Elasticsearch using POST requests;
  • In the end, the final data stored in Elasticsearch is not the
    same than the imported raw data.

The problem we are facing is that when Elasticsearch is restarted, the
JDBC river is reimporting the raw data thus overriding the transformations
made.
We suppose that this is an intentional behavior from Elasticsearch
rivers.
One solution to avoid the reimporting of data is to delete the
corresponding _river index, which is supposed to store the state of the
rivers.

Our questions are as follows :

  • Is the reimporting of data from rivers at each restart is a
    standard use case ? Is it useful for some applications ?
  • What is the point of the _river index state saving ?
    • Is there a way to avoid the reimporting of data without
      having to delete the corresponding _river index ?
    • Is there any downsides (for our use case) to delete the
      corresponding _river index ?

Thanks,
Stéphane.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/2b7f91f1-4fa0-4e66-8193-cd0e6fa35982%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2b7f91f1-4fa0-4e66-8193-cd0e6fa35982%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1a91a264-f53a-49c7-91f4-1438b9de3e91%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1a91a264-f53a-49c7-91f4-1438b9de3e91%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGqANK-5JuvEGyOyKVWW55wknekSmJwBYRFeqkxmATT3w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.