Native approach to search similar documents using Minhash token filter

Or_Chen · March 31, 2020, 4:16pm

I want to implement Minhash-LSH completly on ElasticSearch. It is an algorithm for detecting near-duplicates.
The process for each indexed document is as follows:

Tokenize the text into shingles (Shingle Filter)
Turn the tokens into minhashes (Minhash token)
Split the minhash into bands (???)
hash each band with the band number (???)

then, for each search on the field, I could let the analyzer do the process at search time and see if there are any Documents that are candidates for being duplicates

Does anyone knows which filters can be used for steps 3 and 4? Maybe fingerprint for step 4 but I dont know how to do step 3.

If not, has someone implemented this in some other way?

Thanks for your support

system · April 28, 2020, 4:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Looking for examples of the native minhash being used for near duplicate detection Elasticsearch	1	435	November 13, 2020
Near duplicate detection using MinHash and approximated Jaccard score Elasticsearch	1	1368	April 11, 2019
Unclear minhash filter behavior in near-duplicate detection for short texts Elasticsearch docker	1	22	October 9, 2024
[RFC] idea for a near duplicate filter Elasticsearch	2	1264	July 6, 2017
Duplicate documents detection in Elasticsearch Elasticsearch	4	2809	July 5, 2017

Native approach to search similar documents using Minhash token filter

Related topics