Prevent duplicates in a data stream

Hello,
Is there a way to prevent duplicates in a data stream ?
For a given index, specifying the _id gives us the guarantee that there will be no duplicate with same _id.
For data streams however, it does not work apparently.
We have a (homemade) data collector that has been launched 2 times. As our data stream rolled-over, the same data has been inserted in two different backing indices, so it is present two times in our data stream, which is a huge problem to us.

Are you aware of any solution to this problem ?

I don't believe this is possible due to the way datastreams works, with the rollover, as you point out.

Deduplication made (almost) easy, thanks to Elasticsearch's Aggregations - Spoons Elastic might be a way to clean up, but I can't help with prevention sorry.

Thank you for your response and the reading you proposed.

So it seems we have two solutions here:

  • Deduplicate data as in the link provided. However in our usecase we have billions of document, I am not sur about performances in this case ? But maybe it is the "big data way" of doing things, I don't know.

  • "pre-allocate document to correct index" Implement a similar process of data stream on our side. We define a pattern naming convention like "my-stream-2021-09-02". For each time window that we define (say 30 days), we create a new index from the client side, following the convention.
    Now each time we want to bulk new documents, we take min and max @timestamp of this bulk, we create corresponding indices if they do not exist. Finally, we bulk data and for each one extract from its timestamp the unique corresponding indice to insert to. This way, we have the guarantee that we won't have duplicates.

Yep, unfortunately there's not an ideal solution for this at this point.

If duplication of data is a problem you may need to use standard time-based indices instead of rollover so the timestamp can be used to directly identify the index. See this old blog post for additional details. Some details are out of date but the core problem and solutions remain largely the same. You can still use the split API to adjust the shard count if some indices get too large.

1 Like

Yep that seems to be the second proposed solution, we are going to apply it, thank you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.