River Plugin for postgresql db - deleting documents in ES that are deleted in postgresql db


(Isabella) #1

Hello everyone!

I am creating an river plugin that synchronizes ES and a postgresql
database. I work with an timestamp to check if my document in the database
has changed, if it has changed it is re-indexed in ES if it has not
changed, my plugin skips the document. My timestamp is a column in my
database and it is also indexed with my document. My problem is, that I
don't know how I can synchronize documents that are deleted in the
database? So my id of my document is no longer available in my database,
but how can I tell ES that it has to delete this document, if I do no
longer have the id to delete it. Does anyone has an idea how I could solve
my problem?

I connect to my database, select all documents out of my database and
iterate through my result set. Then I search for each id in ES and check
timestamp from database with timestamp of indexed document in ES. But if my
document is no longer available in my database, I don't know how to inform
ES that a document has been deleted in my database.

I have looked at the jdbc river plugin, and there is used a version for
each move of the river and if the document is deleted in the database, the
version is lower than in ES, so all documents are deleted. But in my case I
only update documents if they have changed, so only those documents that
are updated are getting a higher version, and those documents that are not
changed in the database, have the same version as before. So if I would
also delete documents like it is done in the jdbc river plugin, I also
would delete documents that are still available in my database, but not
updated because they have not changed in the database. I don't think that
this solution is appropriate for my problem.

Does anyone knows if other river plugins are available, that could help me
with my problem? Or does anyone know a solution, to synchronize deleted
documents in ES?

Thanks for any help.

Hopefully I described my problem in a understandable way.

Best regard,

Isabella

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ce722fe4-6d17-4b23-9d4c-e4f5f49ed349%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #2

One way for doing that is to maintain a table of deletions in your database with two columns:
id
deletedate

And fetch that table.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 5 déc. 2013 à 09:22, Isabella isi.huber90@gmail.com a écrit :

Hello everyone!

I am creating an river plugin that synchronizes ES and a postgresql database. I work with an timestamp to check if my document in the database has changed, if it has changed it is re-indexed in ES if it has not changed, my plugin skips the document. My timestamp is a column in my database and it is also indexed with my document. My problem is, that I don't know how I can synchronize documents that are deleted in the database? So my id of my document is no longer available in my database, but how can I tell ES that it has to delete this document, if I do no longer have the id to delete it. Does anyone has an idea how I could solve my problem?

I connect to my database, select all documents out of my database and iterate through my result set. Then I search for each id in ES and check timestamp from database with timestamp of indexed document in ES. But if my document is no longer available in my database, I don't know how to inform ES that a document has been deleted in my database.

I have looked at the jdbc river plugin, and there is used a version for each move of the river and if the document is deleted in the database, the version is lower than in ES, so all documents are deleted. But in my case I only update documents if they have changed, so only those documents that are updated are getting a higher version, and those documents that are not changed in the database, have the same version as before. So if I would also delete documents like it is done in the jdbc river plugin, I also would delete documents that are still available in my database, but not updated because they have not changed in the database. I don't think that this solution is appropriate for my problem.

Does anyone knows if other river plugins are available, that could help me with my problem? Or does anyone know a solution, to synchronize deleted documents in ES?

Thanks for any help.
Hopefully I described my problem in a understandable way.

Best regard,
Isabella

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ce722fe4-6d17-4b23-9d4c-e4f5f49ed349%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/EA4041A6-5C83-4D40-AF65-436E9717E1CB%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #3

What you describe about the JDBC river is correct. However, this will
change in one of the next versions, there will be no more
modification/deletion control by ES external versioning feature.

Instead, I will implement an index alias based time-based indexing scheme,
similar to logstash.

This generational indexing procedure works very simple. At each generation,
the whole index is built again. Not each SQL row ID or doc ID will be
deleted any more because this is too costly. Instead, every once in a
while, like log rotating, an index that is becoming too old can be dropped
automatically by the river switching the alias. The side effect is that
deletions are not visible before a certain delay, presumably hours or days.
From my experience, this is acceptable in most cases.

Another method would be to ingest doc ids of docs to be deleted in a
separate thread. Another SQL statement in the river setup could be executed
to obtain a list of IDs that must be deleted on a regular basis. This could
be run in an extra thread, with the risk of conflicts with the doc ingest
by the river. The list of IDs to be deleted must be managed by the RDBMS.
This is also a disadvantage.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEy%3DTp%3D__wx2DgKCXMUHMBt-yjSQm9UkmmsEzxdcySFwA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #4

Interesting.

I guess it worth it as long as you don't have so much data to fetch.

Does it mean that new documents will also appear after this "certain delay"? I guess so but would like to double check.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 5 décembre 2013 at 09:50:42, joergprante@gmail.com (joergprante@gmail.com) a écrit:

What you describe about the JDBC river is correct. However, this will change in one of the next versions, there will be no more modification/deletion control by ES external versioning feature.

Instead, I will implement an index alias based time-based indexing scheme, similar to logstash.

This generational indexing procedure works very simple. At each generation, the whole index is built again. Not each SQL row ID or doc ID will be deleted any more because this is too costly. Instead, every once in a while, like log rotating, an index that is becoming too old can be dropped automatically by the river switching the alias. The side effect is that deletions are not visible before a certain delay, presumably hours or days. From my experience, this is acceptable in most cases.

Another method would be to ingest doc ids of docs to be deleted in a separate thread. Another SQL statement in the river setup could be executed to obtain a list of IDs that must be deleted on a regular basis. This could be run in an extra thread, with the risk of conflicts with the doc ingest by the river. The list of IDs to be deleted must be managed by the RDBMS. This is also a disadvantage.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEy%3DTp%3D__wx2DgKCXMUHMBt-yjSQm9UkmmsEzxdcySFwA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52a055a8.1381823a.bd3d%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #5

The idea is a sliding window over indexes, I think this is how logstash
setup works.

New documents (and overwritten documents) can be generated at each river
cycle, they appear immediately, in the current index.

If a day is passed, a new index is created for the next day, and started to
get filled by the river. If an index is too old, it is deleted (by
adjusting the index alias).

So the current index filled by the river always grows and never shrinks.
Deletions only take place by dropping a whole index.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEzX7YWtJr%3DvnvEe64nO3eJ6P5%3Da5ivhmrGBqC0W2iL-A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #6

I see. Thanks for the clarification Jörg!

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 5 décembre 2013 at 16:53:24, joergprante@gmail.com (joergprante@gmail.com) a écrit:

The idea is a sliding window over indexes, I think this is how logstash setup works.

New documents (and overwritten documents) can be generated at each river cycle, they appear immediately, in the current index.

If a day is passed, a new index is created for the next day, and started to get filled by the river. If an index is too old, it is deleted (by adjusting the index alias).

So the current index filled by the river always grows and never shrinks. Deletions only take place by dropping a whole index.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEzX7YWtJr%3DvnvEe64nO3eJ6P5%3Da5ivhmrGBqC0W2iL-A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52a0a45f.39ee015c.bd3d%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.


(Gabe Gorelick-Feldman) #7

I highly recommend the found.no blog post on this subject [1]. It gives a
very good overview of your options for keeping elasticsearch in sync with a
datastore of record. Handling deletions are also addressed.

Since you're using postgres, you can also use LISTEN/NOTIFY [2]. Combined
with triggers, it gives you a very easy way to signal that your data has
changed.

[1] https://www.found.no/foundation/keeping-elasticsearch-in-sync/
[2] http://www.postgresql.org/docs/current/static/sql-notify.html

On Thursday, December 5, 2013 3:22:22 AM UTC-5, Isabella wrote:

Hello everyone!

I am creating an river plugin that synchronizes ES and a postgresql
database. I work with an timestamp to check if my document in the database
has changed, if it has changed it is re-indexed in ES if it has not
changed, my plugin skips the document. My timestamp is a column in my
database and it is also indexed with my document. My problem is, that I
don't know how I can synchronize documents that are deleted in the
database? So my id of my document is no longer available in my database,
but how can I tell ES that it has to delete this document, if I do no
longer have the id to delete it. Does anyone has an idea how I could solve
my problem?

I connect to my database, select all documents out of my database and
iterate through my result set. Then I search for each id in ES and check
timestamp from database with timestamp of indexed document in ES. But if my
document is no longer available in my database, I don't know how to inform
ES that a document has been deleted in my database.

I have looked at the jdbc river plugin, and there is used a version for
each move of the river and if the document is deleted in the database, the
version is lower than in ES, so all documents are deleted. But in my case I
only update documents if they have changed, so only those documents that
are updated are getting a higher version, and those documents that are not
changed in the database, have the same version as before. So if I would
also delete documents like it is done in the jdbc river plugin, I also
would delete documents that are still available in my database, but not
updated because they have not changed in the database. I don't think that
this solution is appropriate for my problem.

Does anyone knows if other river plugins are available, that could help me
with my problem? Or does anyone know a solution, to synchronize deleted
documents in ES?

Thanks for any help.

Hopefully I described my problem in a understandable way.

Best regard,

Isabella

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/681c4276-df1f-45ef-8546-f0f978e0f780%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #8