I am ingesting data into a rollover index. The data comes from an API and gives updates on previously ingested events. Each event has a field that is unique and I am using this as the document ID. However, I've come across an issue recently where I have the same event listed multiple times with each of them existing in separate indices. Below is an example where there are three events when it should only be one. I am ingesting the data using Logstash. Why is this happening?
We have an undersized environment that we can't right size so I was trying to conserve resources and improve performance by moving from daily indices that were sometimes only a few MB to rollover based on size.
Some investigation shows that I was sorely mistaken. I assumed fingerprinting would prevent duplication but I just checked an index that ingests millions of events a day with some overlapping events and found duplicates there as well. Looks like I broke the integrity of my data
Is there a process that needs to be implemented to perform the check, does it need to happen on the data delivery side or on the Elasticsearch side?
The data is being delivered by Logstash so is it gonna be something like querying Elasticsearch for a matching event? Though if that's the case, I'm not 100% sure on how I would do that....i'm thinking pull the previously ingested event's index and use that as the index in my elasticsearch output...but that's probably a question for another forum.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.