Preventing duplicates when reading the same data multiple times

Hi,

I'm wondering if anyone has a good solution to the following.
I have to read in .csv files that are generated every hour. These .csv files contain all the historic data for as far back as the program has run. Every hour it runs again and creates a new file that may contain new data at the end, but also everything from the very beginning.

We have to send these files via filebeat -> logstash -> elastic.

How do you prevent the index from filling up with duplicates? With the jdbc plugin you have several options but i dont know of any for text logs (.csv). The .csv file contains both unique id:s and timestamps that could in theory be used if it worked like the jdbc plugin but it doesn't as far as I know.

There seem to be duplicate handling options in Filebeat, but Filebeat doesn't access the .csv fields in our configuration, the .csv parsing happens in Logstash.

How do you solve such an issue? Can you use the ids or timestamps in the .csv file or just flush the index and reread it every time?

Hi,

If your file contains an ID for each entry the easiest way is to use this id as the document id using the elasticsearch output.

Currently, the document id is auto-generated so each entry gets a new Id and as the Id is new the document will be created in elasticsearch.
Afterwards, the document will still be sent to elasticsearch but the id will be used to detect the existence of the document and the document will be updated instead of inserting a duplicate.

Best regards
Wolfram

1 Like

Thank you! That did indeed work.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.