[RFC] idea for a near duplicate filter

Valentin_Pletzer · July 28, 2014, 9:12pm

I have an idea for a filter with a technique I used on another project. I thought I should share because this might be useful to someone.

Finding exact matches is an easy task but finding documents with small differences isnt. Google is using a technique which is very easy to implement and only costs 64 bit (if you choose this size). It is called simhash and is not to be confused with minhash (which Twitter is using).

In simple words simhash works like this:

you hash every token with a classic hash (like Murmur)
you sum up every first bit, second bit and so on (1 = +1 and 0 = -1)
now you have 64 numbers and for every positive number the corresponding simhash bit is 1 and for every zero and negative number the simhash bit is 0

Thats it. Now if you searching for a near duplicate document you only have to compare the simhashes and if it only differs in a small number of bits (the Google paper says a hamming distance of 3 in 64) its a near duplicate.

And there is more. There are things you could do, like giving stop words only a small weight when adding up. Giving synonyms the same (murmur) hash with a smaller weight

Links:
http://www.wwwconference.org/www2007/papers/paper215.pdf

http://www.matpalm.com/resemblance/simhash/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/31fe6560-d066-481d-8b7a-7f51cab735a2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

vineeth_mohan_2 · July 29, 2014, 2:50am

Hello Valentin ,

Thanks for this suggestion.
We might find this useful in some of our projects.

Thanks
Vineeth

On Tue, Jul 29, 2014 at 2:42 AM, Valentin pletzer@gmail.com wrote:

I have an idea for a filter with a technique I used on another project. I
thought I should share because this might be useful to someone.

Finding exact matches is an easy task but finding documents with small
differences isnt. Google is using a technique which is very easy to
implement and only costs 64 bit (if you choose this size). It is called
simhash and is not to be confused with minhash (which Twitter is using).

In simple words simhash works like this:

you hash every token with a classic hash (like Murmur)

you sum up every first bit, second bit and so on (1 = +1 and 0 = -1)

now you have 64 numbers and for every positive number the corresponding
simhash bit is 1 and for every zero and negative number the simhash bit is 0

Thats it. Now if you searching for a near duplicate document you only have
to compare the simhashes and if it only differs in a small number of bits
(the Google paper says a hamming distance of 3 in 64) its a near duplicate.

And there is more. There are things you could do, like giving stop words
only a small weight when adding up. Giving synonyms the same (murmur) hash
with a smaller weight

Links:
http://www.wwwconference.org/www2007/papers/paper215.pdf

http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
simhash

GitHub - twitter/algebird: Abstract Algebra for Scala

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/31fe6560-d066-481d-8b7a-7f51cab735a2%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5%3Doqym%3D2y24rijgWekk2z5%2BZCYX%2BKg3UcCfet3-Lh%2BRjg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Near duplicate document detection Elasticsearch	2	1385	August 12, 2020
Finding documents _almost_ the same Elasticsearch	5	2752	December 13, 2016
Duplicate documents detection in Elasticsearch Elasticsearch	4	2805	July 5, 2017
Native approach to search similar documents using Minhash token filter Elasticsearch	1	504	April 28, 2020
Near duplicate detection using MinHash and approximated Jaccard score Elasticsearch	1	1361	April 11, 2019

[RFC] idea for a near duplicate filter

Related topics