Duplicate IDs across rollover indices

I am ingesting data into a rollover index. The data comes from an API and gives updates on previously ingested events. Each event has a field that is unique and I am using this as the document ID. However, I've come across an issue recently where I have the same event listed multiple times with each of them existing in separate indices. Below is an example where there are three events when it should only be one. I am ingesting the data using Logstash. Why is this happening?

1st Event
image

2nd Event
image

3rd Event
image

if your events are being updating, using rollover is tricky, as you need to figure out which index an event is in before doing the update.

What was the reason for using rollover in this setup? Maybe you can explain a bit more of your rationale here.

We have an undersized environment that we can't right size so I was trying to conserve resources and improve performance by moving from daily indices that were sometimes only a few MB to rollover based on size.

Some investigation shows that I was sorely mistaken. I assumed fingerprinting would prevent duplication but I just checked an index that ingests millions of events a day with some overlapping events and found duplicates there as well. Looks like I broke the integrity of my data :angry:

Is there a process that needs to be implemented to perform the check, does it need to happen on the data delivery side or on the Elasticsearch side?

The data is being delivered by Logstash so is it gonna be something like querying Elasticsearch for a matching event? Though if that's the case, I'm not 100% sure on how I would do that....i'm thinking pull the previously ingested event's index and use that as the index in my elasticsearch output...but that's probably a question for another forum.

so you have time based data, that needs to be updateable. In that case you need to find out where your data is on the client side of things.

You may want to take a look at the elasticsearch filter plugin for logstash, but this will have quite an impact on your processing speed I'd assume.

I actually already opened a post in the Logstash forum regarding this...haven't had much info come from it so far. Configuring Pipeline To Handle Duplicates In Rollover Indices

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.