Hi,
I'm wondering if anyone has a good solution to the following.
I have to read in .csv files that are generated every hour. These .csv files contain all the historic data for as far back as the program has run. Every hour it runs again and creates a new file that may contain new data at the end, but also everything from the very beginning.
We have to send these files via filebeat -> logstash -> elastic.
How do you prevent the index from filling up with duplicates? With the jdbc plugin you have several options but i dont know of any for text logs (.csv). The .csv file contains both unique id:s and timestamps that could in theory be used if it worked like the jdbc plugin but it doesn't as far as I know.
There seem to be duplicate handling options in Filebeat, but Filebeat doesn't access the .csv fields in our configuration, the .csv parsing happens in Logstash.
How do you solve such an issue? Can you use the ids or timestamps in the .csv file or just flush the index and reread it every time?