What's the best way to deal deduplication of documents in the
cluster? Imagine the same document comes down some pipe to the app
that indexes docs into ES. Imagine there is no unique and persistent
document ID one could use to quickly query the cluster to see if such
a doc already exist. How would one detect that the same doc already
exists in order to avoid indexing this document twice?
I think at that point, you would only be able to query for the existence of
a prior indexed document using the other fields in the document which would
uniquely identify the document. It might be quicker to index the document
again.
What's the best way to deal deduplication of documents in the
cluster? Imagine the same document comes down some pipe to the app
that indexes docs into ES. Imagine there is no unique and persistent
document ID one could use to quickly query the cluster to see if such
a doc already exist. How would one detect that the same doc already
exists in order to avoid indexing this document twice?
deduplication is a complicate task. But the main idea is to query with
'more like this' or 'fuzzy like this' for existing documents. You can
take a look here:
Or at jetwick where I implemented this for tweets (via my own 'more
like this' and afterwards a jaccard index check):
This has a large impact on indexing performance. At this time I
implemented this with a refresh call after bulk indexing, but now I
have changed this to avoid the call with the help of the versioning
feature.
But if you want faster duplicate queries (which is especially
difficult for large documents) you'll need to use a local sensitive
hashing technics:
If you want to do deduplication you need to be able to define equality
between two documents. Ideally your documents should contain a unique
key to compare on. If there are none you could compute such a unique
key using, for example, some hashing over your content (md5/sha1 over
some field(s)).
Depending on the rate at which you insert new documents, you will most
probably want to do you deduplication before the indexing phase in ES.
For this you can keep the list of inserted unique IDs in memory, in a
bloom filter for example. At this point it really depends on your
architecture/use-case but there are many alternatives to avoid hitting
ES for a duplicate check before indexing every document.
If you want better suggestions, give us a little more context on your
ES setup and user-case.
What's the best way to deal deduplication of documents in the
cluster? Imagine the same document comes down some pipe to the app
that indexes docs into ES. Imagine there is no unique and persistent
document ID one could use to quickly query the cluster to see if such
a doc already exist. How would one detect that the same doc already
exists in order to avoid indexing this document twice?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.