Near duplicate detection using MinHash and approximated Jaccard score

woutermostard · March 14, 2019, 9:09am

Hi all,

I am trying to find near duplicates of large documents. First I split the data using a whitespace tokenizer and subsequently apply a minhash token filter in a custom analyzer. Then the analyzer is mapped to a field named "title".

{
"settings": {
"analysis": {
  "filter": {
    "my_minhash_filter": {
      "type": "min_hash",
      "hash_count": 1,   
      "bucket_count": 128, 
      "hash_set_size": 1, 
      "with_rotation": true 
    }
  },
  "analyzer": {
    "my_analyzer": {
      "type": "custom",
      "tokenizer": "whitespace",
      "filter": [
      	"my_minhash_filter"
      ]
    }
  }
}
},
"mappings": {
  "_doc": {
    "properties": {
      "title": {
        "type": "text",
    	"analyzer": "my_analyzer",
    	"search_analyzer": "my_analyzer"
      }
    }
  }
 }
}

Then, using the python API I fill the index with some example data

from elasticsearch import Elasticsearch
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

for index, example in enumerate(twenty_train['data']):
     es.create(index='test_analyzer', doc_type='_doc', id=index, body={"title": example})

To check whether this works I tried to retrieve some near duplicates

{
"query": {
    "more_like_this": {
    	"fields": ["title"],
        "like": "From: steve@titan.tsd.arlut.utexas.edu (Steve Glicker)\nSubject: 2 1000W Power Supplies\nNntp-Posting-Host: rooster\nOrganization: Applied Research Labs, The University of Texas at Austin\nDistribution: misc\nLines: 14\n\nTwo LH Research SM11-1 power supplies (SM10 series).\n\n1000W, 5V, 200A (currently wired for 115VAC)\n\nControl lines: +/- sense, on/off, pwr.fail, high/low margin, and\ncurrent monitor.\n\n(The list price from LH Research is $824.00 each for qty. 1-9)\n\nAsking $500.00 for the pair.\n\nSteve Glicker\nAustin, Texas\n(steve@titan.tsd.arlut.utexas.edu)\n"
    }
}
}

This works and does give as a result the most similar documents. However, this does not completely give the wanted result. What I would like is to return only the results that are above some approximated kappa score, as is common with minhash and LSH.

I tried Googling some results and checked the forums here but when I look for minhash I only get 5 forum results.

Thank you in advance.

Wouter

system · April 11, 2019, 9:09am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Looking for examples of the native minhash being used for near duplicate detection Elasticsearch	1	435	November 13, 2020
Unclear minhash filter behavior in near-duplicate detection for short texts Elasticsearch docker	1	22	October 9, 2024
Native approach to search similar documents using Minhash token filter Elasticsearch	1	511	April 28, 2020
Near duplicate document detection Elasticsearch	2	1396	August 12, 2020
[RFC] idea for a near duplicate filter Elasticsearch	2	1264	July 6, 2017

Near duplicate detection using MinHash and approximated Jaccard score

Related topics