I have duplicate entries in my elasticsearch indexes.
Have used document_id which prevented duplicates from appearing. But the issue with this is that it overwrites and updates the duplicate > effectively removing the older copy which is 'correct'.
Can anyone point out if there is anyway to prevent the overwriting/updating from happening, and instead just tell ElasticSearch to ignore the duplicate that was detected from document_id?
Apologies as I have forgotten to mention how I am using the data.
Yes, I am ingesting in via Logstash:
From the example you shown me, all I need to do is to add the actions => "create" line into my conf file.
After which Logstash will not index and overwrite documents which already have an ID > if I were to simulate a duplicate this time round, nothing will happen (unlike before whereby documents were overwritten).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.