Hi
I have a file which is rewritten daily with new data.
Everything was working fine till I missed to update the file with new data.
So, Logstash processed the old data itself again generating duplicate data.
Is there a way that i can correct that(remove duplicate data)?
Is there a way to avoid this issue in future?
Yes, there is a way.
Have a look on the fingerprint filter and MD5 hash calculation.
You calculate MD5 out of your message and later use it in the output as the document id when loading the data to elasticsearch.
If you want to remove duplicate data:
read the data from the elasticsearch in logstash (input elasticsearch)
in filter section calculate md5 out of the fields you want to have unique id
output to the elasticsearch with the document id calculated in a fingerprint section.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.