Remove/Prevent duplicates with rollover

Hi,

One of my index has an ILM (index lifecycle management) with rollovers.

The problem is, when the index receive data and rollover at the same time, the latest data are duplicated (present in both the new and old index).

How to prevent / remove duplicates after a rollover?

Thanks for your time!

Are you indexing immutable data or also performing updates?

Which version of Elasticsearch are you using?

How are you indexing data into Elasticsearch?

Data come from a Transform, but they are not updated (I have some bucket selector which put on hold the Transform if it's not complete). The duplicated rows are identical for the both (previous and current) indexes.

Elasticsearch version 7.17.3

Filebeat > Elasticsearch ingest node > Transform > ILM

So the target index the transform writes to is using rollover and is managed by ILM? If that is the case I believe this is not supported.

So how can I store the result of my Transforms?

If I need data for 3 months, then the only way is to keep a giant hot index with everything in it?

Yes, I believe you would need to use a single destination index and clear data using delete by query.

What is delete by query?

And is there a "add by query" where I can shift the old data to an ILM index?

That would probably require you to reindex data no longer being updated before deleting it from the original index. It may be difficult to get this consistent though.

A query as "reindex data between ereyesterday 00:00:00 hour and yesterday 00:00:00 hour" and then "delete those data on the same date range" will not be consistent?

The two operations will take some time to run and during that time you would have duplicates, and there could be failures that need to be handled. This also assumes that you do not have any data coming in late updating any of the documents that have been transferred, and I suspect this depends on the nature and logic of the transform.

Ok, thanks for the advices.
I will make some test to find the best solution for my case.

Have a nice day!

If this solution require you to perform a reindex as well as a delete by query for every document, would it not be better to have a single transform index with a larger number of primary shards and delete documents by delete by query once they no longer need to be retained. It would be simpler and also add less load on the cluster.

But if for some reason I need to keep the data for 6 months, or a year?
Or I just want to change the template, because I have a new field or way to group the data?

I need to keep it separated.

Storing data in 10 indices with 1 primary shard each or 1 single index with 10 primary shards is basically the same. Having only one index is simpler and require a lot less work and load on the cluster.

I am not sure how you handle changes to transforms, but do not see how switching to a different time based index is any different from starting to use a new single index. You can still query the old and new singular index through an index pattern or alias and have the same issues.

And what about the Date index name? If I change the destination (previously an index) to an ingest pipeline, does it consume a lot more resources?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.