I want to implement Minhash-LSH completly on ElasticSearch. It is an algorithm for detecting near-duplicates.
The process for each indexed document is as follows:
- Tokenize the text into shingles (Shingle Filter)
- Turn the tokens into minhashes (Minhash token)
- Split the minhash into bands (???)
- hash each band with the band number (???)
then, for each search on the field, I could let the analyzer do the process at search time and see if there are any Documents that are candidates for being duplicates
Does anyone knows which filters can be used for steps 3 and 4? Maybe fingerprint for step 4 but I dont know how to do step 3.
If not, has someone implemented this in some other way?
Thanks for your support