We are preparing a pipeline, responsible for getting the 'current state' of data in our source table.
What it means is that the source table contains let's say 10 records. We would like to reflect those 10 records in Elasticsearch using Logstash to pull them every 10 minutes. During the day, the number of names might change, when for example someone will be deleted. So once the Logstash will run and will pull those 9 records, we would like to have it reflected in Elasticsearch with an index with 9 documents. We don't need old documents, as we want to see the 'current state'. We've been thinking about a mechanism that will truncate/delete index before new data will be pushed, but I'm not sure how we could achieve that using only Logstash and Elasticsearch and making sure that the data will always be present in an index.
Is that achievable in an automatic manner?
Anything that comes to your mind? I know that for updating existing records, we can use upsert, but how about deleting documents that no longer exist in source table? Is that possible to do using logstash?
Read this and specifically the "Also be patient" part.
It's fine to answer on your own thread after 2 or 3 days (not including weekends) if you don't have an answer.
We are not all guys fortunately. I think that Hi! is perfectly enough
Not really. You can consider different options:
Use a technical temporary table of deleted items. Read that table and delete every document which is referenced in it.
Use a trigger
Modify the application layer (the service layer) and do that in real time. That's my preferred way.
Noted. Appreciate it. The rush comes from the fact that we are currently doing PoC with ELK as a possible data platform, so the more information we get, the faster we can get to the step of evaluation, so apologies for that.
As for the proposal, unfortunately the source database cannot be changed in any way. We need to rely on it with current form.
I'm not yet familiar with triggers, so I'll sing my teeth into the documentation.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.