If the source DB has a document id that you can use throughout the system (e.g. [our_internal_id]) then that makes things easier.
One approach would be to re-fetch the entire source DB with logstash and write it to a new index, then point an active alias to the new index and delete the old one (outside of logstash).
If the source DB supports triggers then you may be able to send a record to logstash to tell it that a DB row has been deleted so that LS can delete it from ES.
You might be able to do it by repeatedly fetching all the ids from the destination DB and testing whether they exist in the source. If not, delete them from the destination.
The use case probably has constraints on how quickly new documents must be indexed, how quickly deleted documents must be removed, and how rapidly the source data changes.
If the source supports triggers for updates then this could be quite efficient, if the source does not change very often but you have to pull the entire DB over and over again to detect changes then it is going to be expensive.
If the source DB does not have a document id field then you can add one that defaults to NULL and populate it in the source for records when they are added to the destination.
There are definitely hi-level flows other than those mentioned above that could implement this.
Thanks for the reply. The data we're consuming is just a SQL view that we check every few hours and only has about 10k rows in it so it's not that expensive to load imo. I don't have much control over the backend so the trigger option won't work.
In the meantime, I've also found these posts. One that can use an index template to accomplish this apparently but no details
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.