I am fetching records from database and keeping document id as primary key id.
I have created indices based on date. So everyday new index has been created.
like, test-2019.09.23, test-2019.09.22
This logstash job runs every 5 minutes. This is requirement.
Now if records has been updated within a day then elasticsearch maintains it properly but if records has been updated on another day then it will be treated as a new document in new index.
This is creating issue as "multiple duplicate records" while searching and fetching the records.
I want to handle one copy(or latest) of same document across indices.
How to handle it?
Any configuration suggestions? or any sample example would be helpful.
Unfortunatly, there is no simple out of box approch to handle duplicates across multiple indices.
The solution will have to be seperate de-dupe process.
One solution can be if you have last_update date field that gets updated when records are updated, than you can use that information to hadle updates seperatly from new records. You can search all current & old indices using alias to check _id and do upsert on that index. Comes with some performance hit as you will searching before inserting.
Or you can continue creating duplicates and run de-dupe aggrigation at regular interval, that will identify duplicates and delete older record (original) and leave newer records (updated). With this stratergy your indices will have duplicate for some time when updated record is created but de-dupe has not ran.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.