If you want to do deduplication you need to be able to define equality
between two documents. Ideally your documents should contain a unique
key to compare on. If there are none you could compute such a unique
key using, for example, some hashing over your content (md5/sha1 over
Depending on the rate at which you insert new documents, you will most
probably want to do you deduplication before the indexing phase in ES.
For this you can keep the list of inserted unique IDs in memory, in a
bloom filter for example. At this point it really depends on your
architecture/use-case but there are many alternatives to avoid hitting
ES for a duplicate check before indexing every document.
If you want better suggestions, give us a little more context on your
ES setup and user-case.
On Wed, Jun 8, 2011 at 7:52 PM, Otis email@example.com wrote:
What's the best way to deal deduplication of documents in the
cluster? Imagine the same document comes down some pipe to the app
that indexes docs into ES. Imagine there is no unique and persistent
document ID one could use to quickly query the cluster to see if such
a doc already exist. How would one detect that the same doc already
exists in order to avoid indexing this document twice?