Duplicate IDs across rollover indices

wwalker · August 22, 2019, 6:40pm

I am ingesting data into a rollover index. The data comes from an API and gives updates on previously ingested events. Each event has a field that is unique and I am using this as the document ID. However, I've come across an issue recently where I have the same event listed multiple times with each of them existing in separate indices. Below is an example where there are three events when it should only be one. I am ingesting the data using Logstash. Why is this happening?

1st Event

2nd Event

3rd Event

spinscale · August 23, 2019, 7:18am

if your events are being updating, using rollover is tricky, as you need to figure out which index an event is in before doing the update.

What was the reason for using rollover in this setup? Maybe you can explain a bit more of your rationale here.

wwalker · August 23, 2019, 3:36pm

We have an undersized environment that we can't right size so I was trying to conserve resources and improve performance by moving from daily indices that were sometimes only a few MB to rollover based on size.

Some investigation shows that I was sorely mistaken. I assumed fingerprinting would prevent duplication but I just checked an index that ingests millions of events a day with some overlapping events and found duplicates there as well. Looks like I broke the integrity of my data

wwalker · August 23, 2019, 4:01pm

Is there a process that needs to be implemented to perform the check, does it need to happen on the data delivery side or on the Elasticsearch side?

The data is being delivered by Logstash so is it gonna be something like querying Elasticsearch for a matching event? Though if that's the case, I'm not 100% sure on how I would do that....i'm thinking pull the previously ingested event's index and use that as the index in my elasticsearch output...but that's probably a question for another forum.

spinscale · August 26, 2019, 8:40am

so you have time based data, that needs to be updateable. In that case you need to find out where your data is on the client side of things.

You may want to take a look at the elasticsearch filter plugin for logstash, but this will have quite an impact on your processing speed I'd assume.

wwalker · August 26, 2019, 2:18pm

I actually already opened a post in the Logstash forum regarding this...haven't had much info come from it so far. Configuring Pipeline To Handle Duplicates In Rollover Indices

system · September 23, 2019, 2:25pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Configuring Pipeline To Handle Duplicates In Rollover Indices Logstash	3	1043	September 20, 2019
Indice rollower duplicte docs Elasticsearch	5	376	November 11, 2019
Duplication Due to rollover policy in kibana,data coming from logstash pipeline Logstash	10	792	January 13, 2023
Duplicates after index rollover Elasticsearch ilm-index-lifecycle-management	17	1047	October 26, 2023
How to avoid double indexing when using rollover indice Logstash	8	318	October 12, 2022

Duplicate IDs across rollover indices

Related topics