Indexing-time document deduplication

otisg · June 8, 2011, 11:52pm

Hello,

What's the best way to deal deduplication of documents in the
cluster? Imagine the same document comes down some pipe to the app
that indexes docs into ES. Imagine there is no unique and persistent
document ID one could use to quickly query the cluster to see if such
a doc already exist. How would one detect that the same doc already
exists in order to avoid indexing this document twice?

Thanks,
Otis

James_Cook · June 9, 2011, 1:07am

I think at that point, you would only be able to query for the existence of
a prior indexed document using the other fields in the document which would
uniquely identify the document. It might be quicker to index the document
again.

On Wed, Jun 8, 2011 at 7:52 PM, Otis otis.gospodnetic@gmail.com wrote:

Hello,

What's the best way to deal deduplication of documents in the
cluster? Imagine the same document comes down some pipe to the app
that indexes docs into ES. Imagine there is no unique and persistent
document ID one could use to quickly query the cluster to see if such
a doc already exist. How would one detect that the same doc already
exists in order to avoid indexing this document twice?

Thanks,
Otis

fashionalwallet · June 9, 2011, 8:53am

deleted -

Karussell1 · June 9, 2011, 11:47am

Otis,

deduplication is a complicate task. But the main idea is to query with
'more like this' or 'fuzzy like this' for existing documents. You can
take a look here:

http://zmievski.org/2011/03/duplicates-detection-with-elasticsearch

Or at jetwick where I implemented this for tweets (via my own 'more
like this' and afterwards a jaccard index check):

github.com

karussell/Jetwick/blob/master/src/main/java/de/jetwick/es/ElasticTweetSearch.java#L971


    } catch (Exception ex) {
        throw new RuntimeException(ex);
    }
}


public Collection<String> searchTrends(JetwickQuery q, int limit) {
    try {
        q.addFacetField(TAG);
        SearchResponse rsp = query(q);
        Facets facets = rsp.facets();
        if (facets == null)
            return Collections.emptyList();


        Set<String> set = new LinkedHashSet<String>();
        for (Facet facet : facets.facets()) {
            if (facet instanceof TermsFacet) {
                TermsFacet ff = (TermsFacet) facet;
                for (TermsFacet.Entry e : ff.entries()) {
                    if (e.count() > limit)
                        set.add(e.getTerm());
                }

This has a large impact on indexing performance. At this time I
implemented this with a refresh call after bulk indexing, but now I
have changed this to avoid the call with the help of the versioning
feature.

But if you want faster duplicate queries (which is especially
difficult for large documents) you'll need to use a local sensitive
hashing technics:

which requires extra space (and memory).

Regards,
Peter.

Karussell1 · June 9, 2011, 7:15pm

Or did you mean the dedup of Solr

http://wiki.apache.org/solr/Deduplication

?

It should work to grab the sources and use this for ES (as solr uses
an the textsignature algo borrowed from tika, if I remember correctly).

colinsurprenant · June 9, 2011, 8:09pm

If you want to do deduplication you need to be able to define equality
between two documents. Ideally your documents should contain a unique
key to compare on. If there are none you could compute such a unique
key using, for example, some hashing over your content (md5/sha1 over
some field(s)).

Depending on the rate at which you insert new documents, you will most
probably want to do you deduplication before the indexing phase in ES.
For this you can keep the list of inserted unique IDs in memory, in a
bloom filter for example. At this point it really depends on your
architecture/use-case but there are many alternatives to avoid hitting
ES for a duplicate check before indexing every document.

If you want better suggestions, give us a little more context on your
ES setup and user-case.

Colin

On Wed, Jun 8, 2011 at 7:52 PM, Otis otis.gospodnetic@gmail.com wrote:

Hello,

What's the best way to deal deduplication of documents in the
cluster? Imagine the same document comes down some pipe to the app
that indexes docs into ES. Imagine there is no unique and persistent
document ID one could use to quickly query the cluster to see if such
a doc already exist. How would one detect that the same doc already
exists in order to avoid indexing this document twice?

Thanks,
Otis

Topic		Replies	Views
Deduplication filter? Elasticsearch	4	4788	July 6, 2017
Indexing same document twice Elasticsearch	5	10261	July 5, 2017
How to handle only one same document in multiple indices (indices are based on everyday) Elasticsearch	2	386	October 21, 2019
Logstash don't detect duplicated documents Logstash	2	278	July 3, 2018
ES query to check the existence of a document_id? Logstash	10	983	June 26, 2020

Indexing-time document deduplication

Related topics