Hello,
Is there a way to prevent duplicates in a data stream ?
For a given index, specifying the _id gives us the guarantee that there will be no duplicate with same _id.
For data streams however, it does not work apparently.
We have a (homemade) data collector that has been launched 2 times. As our data stream rolled-over, the same data has been inserted in two different backing indices, so it is present two times in our data stream, which is a huge problem to us.
Thank you for your response and the reading you proposed.
So it seems we have two solutions here:
Deduplicate data as in the link provided. However in our usecase we have billions of document, I am not sur about performances in this case ? But maybe it is the "big data way" of doing things, I don't know.
"pre-allocate document to correct index" Implement a similar process of data stream on our side. We define a pattern naming convention like "my-stream-2021-09-02". For each time window that we define (say 30 days), we create a new index from the client side, following the convention.
Now each time we want to bulk new documents, we take min and max @timestamp of this bulk, we create corresponding indices if they do not exist. Finally, we bulk data and for each one extract from its timestamp the unique corresponding indice to insert to. This way, we have the guarantee that we won't have duplicates.
If duplication of data is a problem you may need to use standard time-based indices instead of rollover so the timestamp can be used to directly identify the index. See this old blog post for additional details. Some details are out of date but the core problem and solutions remain largely the same. You can still use the split API to adjust the shard count if some indices get too large.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.