For awhile, I've been using the fingerprint filter and setting the resulting value to the document id when outputting to Elasticsearch to handle duplicate prevention. I've recently discovered (the hard way) that this method is not effective when you are using rollover indices. So how can we go about handling de-duplication with rollover indices?
I haven't done any testing just yet, but I believe the below process will work. Anybody see problems with this or know of a better way?
Fingerprint the document as usual
Perform an elasticsearch lookup using the fingerprint as the query and capture the _index field for this query.
During output, place a logic statement saying, "If _index is not null, use the _index field value as the index value in the Elasticsearch output".
if you could make all your events contain the initial time stamp you could use this for the @timestamp field together with standard time-based indices, e.g. monthly. This would send all related events to the same index which means it would update if you set the ID to a fingerprint. Am not sure Logstash can do what you proposed and suspect it would be very, very slow if you tried.
I agree that it would significantly slow down all aspects of the Elasticstack. Most of my indices contain a date/time field for when the event occurred, but how would I use that in the elasticsearch index, can you give me an example?
If I have an event with a timefield of 8/20/2019 15:30:45, how do I get the Elasticsearch output's index option to look at just the monthly/weekly value, using mutate and gsub?
Another problem I see with this is that one of my indices grows about 10 million events per day and controlling it by size using rollover and ILM seems to be working well but with this method, size control is lost. Is there a way to account for something like that?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.