Prevent duplicates in a data stream

romathonat · September 1, 2021, 4:07pm

Hello,
Is there a way to prevent duplicates in a data stream ?
For a given index, specifying the _id gives us the guarantee that there will be no duplicate with same _id.
For data streams however, it does not work apparently.
We have a (homemade) data collector that has been launched 2 times. As our data stream rolled-over, the same data has been inserted in two different backing indices, so it is present two times in our data stream, which is a huge problem to us.

Are you aware of any solution to this problem ?

warkolm · September 1, 2021, 10:19pm

I don't believe this is possible due to the way datastreams works, with the rollover, as you point out.

Deduplication made (almost) easy, thanks to Elasticsearch's Aggregations - Spoons Elastic might be a way to clean up, but I can't help with prevention sorry.

romathonat · September 2, 2021, 6:59am

Thank you for your response and the reading you proposed.

So it seems we have two solutions here:

Deduplicate data as in the link provided. However in our usecase we have billions of document, I am not sur about performances in this case ? But maybe it is the "big data way" of doing things, I don't know.
"pre-allocate document to correct index" Implement a similar process of data stream on our side. We define a pattern naming convention like "my-stream-2021-09-02". For each time window that we define (say 30 days), we create a new index from the client side, following the convention.
Now each time we want to bulk new documents, we take min and max @timestamp of this bulk, we create corresponding indices if they do not exist. Finally, we bulk data and for each one extract from its timestamp the unique corresponding indice to insert to. This way, we have the guarantee that we won't have duplicates.

warkolm · September 2, 2021, 7:04am

Yep, unfortunately there's not an ideal solution for this at this point.

Christian_Dahlqvist · September 2, 2021, 7:27am

If duplication of data is a problem you may need to use standard time-based indices instead of rollover so the timestamp can be used to directly identify the index. See this old blog post for additional details. Some details are out of date but the core problem and solutions remain largely the same. You can still use the split API to adjust the shard count if some indices get too large.

romathonat · September 2, 2021, 7:40am

Yep that seems to be the second proposed solution, we are going to apply it, thank you.

system · September 30, 2021, 7:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing-time document deduplication Elasticsearch	6	2593	July 6, 2017
ES design regarding duplicates across indexes Elasticsearch	9	5173	March 1, 2018
Delete duplicate items Elasticsearch	1	335	July 6, 2017
Duplicate IDs across rollover indices Elasticsearch	6	2401	September 23, 2019
Disallow duplicates Elasticsearch	1	226	July 6, 2017

Prevent duplicates in a data stream

Related topics