What is the fastest way to prevent duplicates during indexing time ?
(I'm using es5.2)
What is the fastest way to prevent duplicates during indexing time ?
(I'm using es5.2)
The chances of a duplicate ID with this sort of data is pretty slim.
This depends on the use-case. If your data has a natural ID or one can be created, you can use this to make duplicate inserts of the same event into the same index result in an update instead of an insert. If you have events coming in for which it is hard to create a unique ID, you can create an hash based on the content and use this as an ID, e.g. using the fingerprint plugin. This does not guarantee that there will be no hash collisions, but it should be rare.
I created an experimental plugin called hashid to generate IDs based on hashed event data, which also includes a timestamp at the beginning, and therefore should be more Lucene friendly and give better performance.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.