Remove/Prevent duplicates with rollover

bianca6 · July 27, 2022, 7:22am

Hi,

One of my index has an ILM (index lifecycle management) with rollovers.

The problem is, when the index receive data and rollover at the same time, the latest data are duplicated (present in both the new and old index).

How to prevent / remove duplicates after a rollover?

Thanks for your time!

Christian_Dahlqvist · July 27, 2022, 8:48am

Are you indexing immutable data or also performing updates?

Which version of Elasticsearch are you using?

How are you indexing data into Elasticsearch?

bianca6 · July 27, 2022, 8:53am

Data come from a Transform, but they are not updated (I have some bucket selector which put on hold the Transform if it's not complete). The duplicated rows are identical for the both (previous and current) indexes.

Elasticsearch version 7.17.3

Filebeat > Elasticsearch ingest node > Transform > ILM

Christian_Dahlqvist · July 27, 2022, 8:58am

So the target index the transform writes to is using rollover and is managed by ILM? If that is the case I believe this is not supported.

bianca6 · July 27, 2022, 8:59am

So how can I store the result of my Transforms?

If I need data for 3 months, then the only way is to keep a giant hot index with everything in it?

Christian_Dahlqvist · July 27, 2022, 9:00am

Yes, I believe you would need to use a single destination index and clear data using delete by query.

bianca6 · July 27, 2022, 9:03am

What is delete by query?

And is there a "add by query" where I can shift the old data to an ILM index?

Christian_Dahlqvist · July 27, 2022, 9:08am

That would probably require you to reindex data no longer being updated before deleting it from the original index. It may be difficult to get this consistent though.

bianca6 · July 27, 2022, 9:13am

A query as "reindex data between ereyesterday 00:00:00 hour and yesterday 00:00:00 hour" and then "delete those data on the same date range" will not be consistent?

Christian_Dahlqvist · July 27, 2022, 9:16am

The two operations will take some time to run and during that time you would have duplicates, and there could be failures that need to be handled. This also assumes that you do not have any data coming in late updating any of the documents that have been transferred, and I suspect this depends on the nature and logic of the transform.

bianca6 · July 27, 2022, 9:17am

Ok, thanks for the advices.
I will make some test to find the best solution for my case.

Have a nice day!

Christian_Dahlqvist · July 27, 2022, 9:20am

If this solution require you to perform a reindex as well as a delete by query for every document, would it not be better to have a single transform index with a larger number of primary shards and delete documents by delete by query once they no longer need to be retained. It would be simpler and also add less load on the cluster.

bianca6 · July 27, 2022, 9:28am

But if for some reason I need to keep the data for 6 months, or a year?
Or I just want to change the template, because I have a new field or way to group the data?

I need to keep it separated.

Christian_Dahlqvist · July 27, 2022, 9:33am

Storing data in 10 indices with 1 primary shard each or 1 single index with 10 primary shards is basically the same. Having only one index is simpler and require a lot less work and load on the cluster.

I am not sure how you handle changes to transforms, but do not see how switching to a different time based index is any different from starting to use a new single index. You can still query the old and new singular index through an index pattern or alias and have the same issues.

bianca6 · July 27, 2022, 9:49am

And what about the Date index name? If I change the destination (previously an index) to an ingest pipeline, does it consume a lot more resources?

system · August 24, 2022, 9:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicates after index rollover Elasticsearch ilm-index-lifecycle-management	17	1047	October 26, 2023
Duplicate IDs across rollover indices Elasticsearch	6	2402	September 23, 2019
Indice rollower duplicte docs Elasticsearch	5	376	November 11, 2019
How to archive data from a Transform? Elasticsearch	4	244	August 23, 2022
Rollover Index duplication data,data coming from logstash Elasticsearch	19	1773	July 14, 2023

Remove/Prevent duplicates with rollover

Related topics