Preventing duplicates when reading the same data multiple times

wlbsxnlp22 · May 20, 2021, 6:42pm

Hi,

I'm wondering if anyone has a good solution to the following.
I have to read in .csv files that are generated every hour. These .csv files contain all the historic data for as far back as the program has run. Every hour it runs again and creates a new file that may contain new data at the end, but also everything from the very beginning.

We have to send these files via filebeat -> logstash -> elastic.

How do you prevent the index from filling up with duplicates? With the jdbc plugin you have several options but i dont know of any for text logs (.csv). The .csv file contains both unique id:s and timestamps that could in theory be used if it worked like the jdbc plugin but it doesn't as far as I know.

There seem to be duplicate handling options in Filebeat, but Filebeat doesn't access the .csv fields in our configuration, the .csv parsing happens in Logstash.

How do you solve such an issue? Can you use the ids or timestamps in the .csv file or just flush the index and reread it every time?

Wolfram_Haussig · May 20, 2021, 7:47pm

Hi,

If your file contains an ID for each entry the easiest way is to use this id as the document id using the elasticsearch output.

Currently, the document id is auto-generated so each entry gets a new Id and as the Id is new the document will be created in elasticsearch.
Afterwards, the document will still be sent to elasticsearch but the id will be used to detect the existence of the document and the document will be updated instead of inserting a duplicate.

Best regards
Wolfram

wlbsxnlp22 · May 25, 2021, 8:34am

Thank you! That did indeed work.

system · June 22, 2021, 8:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reg csv file import into Elasticsearch using Logstash Logstash	13	2005	February 26, 2021
How to remove duplicate events in logstash Logstash	3	4421	January 4, 2017
Duplicate Entries of Log data Elasticsearch	6	4802	September 29, 2017
Avoid duplication Logstash	13	5205	December 7, 2018
Keeping duplicates Logstash	5	285	September 19, 2019

Preventing duplicates when reading the same data multiple times

Related topics