Configuring Pipeline To Handle Duplicates In Rollover Indices

wwalker · August 23, 2019, 4:41pm

For awhile, I've been using the fingerprint filter and setting the resulting value to the document id when outputting to Elasticsearch to handle duplicate prevention. I've recently discovered (the hard way) that this method is not effective when you are using rollover indices. So how can we go about handling de-duplication with rollover indices?

I haven't done any testing just yet, but I believe the below process will work. Anybody see problems with this or know of a better way?

Fingerprint the document as usual
Perform an elasticsearch lookup using the fingerprint as the query and capture the _index field for this query.
During output, place a logic statement saying, "If _index is not null, use the _index field value as the index value in the Elasticsearch output".

Christian_Dahlqvist · August 23, 2019, 4:48pm

if you could make all your events contain the initial time stamp you could use this for the @timestamp field together with standard time-based indices, e.g. monthly. This would send all related events to the same index which means it would update if you set the ID to a fingerprint. Am not sure Logstash can do what you proposed and suspect it would be very, very slow if you tried.

wwalker · August 23, 2019, 5:00pm

I agree that it would significantly slow down all aspects of the Elasticstack. Most of my indices contain a date/time field for when the event occurred, but how would I use that in the elasticsearch index, can you give me an example?

If I have an event with a timefield of 8/20/2019 15:30:45, how do I get the Elasticsearch output's index option to look at just the monthly/weekly value, using mutate and gsub?

Another problem I see with this is that one of my indices grows about 10 million events per day and controlling it by size using rollover and ILM seems to be working well but with this method, size control is lost. Is there a way to account for something like that?

system · September 20, 2019, 5:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicate IDs across rollover indices Elasticsearch	6	2387	September 23, 2019
Avoid duplication Logstash	13	5197	December 7, 2018
Duplicates after index rollover Elasticsearch ilm-index-lifecycle-management	17	1020	October 26, 2023
Dupilcate message on elasticsearch Logstash	6	952	December 31, 2017
Avoid duplicate document in different Indices,Logsatsh Logstash	2	533	July 28, 2022

Configuring Pipeline To Handle Duplicates In Rollover Indices

Related topics