Near duplicate detection using MinHash and approximated Jaccard score

Hi all,

I am trying to find near duplicates of large documents. First I split the data using a whitespace tokenizer and subsequently apply a minhash token filter in a custom analyzer. Then the analyzer is mapped to a field named "title".

{
"settings": {
"analysis": {
  "filter": {
    "my_minhash_filter": {
      "type": "min_hash",
      "hash_count": 1,   
      "bucket_count": 128, 
      "hash_set_size": 1, 
      "with_rotation": true 
    }
  },
  "analyzer": {
    "my_analyzer": {
      "type": "custom",
      "tokenizer": "whitespace",
      "filter": [
      	"my_minhash_filter"
      ]
    }
  }
}
},
"mappings": {
  "_doc": {
    "properties": {
      "title": {
        "type": "text",
    	"analyzer": "my_analyzer",
    	"search_analyzer": "my_analyzer"
      }
    }
  }
 }
}

Then, using the python API I fill the index with some example data

from elasticsearch import Elasticsearch
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

for index, example in enumerate(twenty_train['data']):
     es.create(index='test_analyzer', doc_type='_doc', id=index, body={"title": example})

To check whether this works I tried to retrieve some near duplicates

{
"query": {
    "more_like_this": {
    	"fields": ["title"],
        "like": "From: steve@titan.tsd.arlut.utexas.edu (Steve Glicker)\nSubject: 2 1000W Power Supplies\nNntp-Posting-Host: rooster\nOrganization: Applied Research Labs, The University of Texas at Austin\nDistribution: misc\nLines: 14\n\nTwo LH Research SM11-1 power supplies (SM10 series).\n\n1000W, 5V, 200A (currently wired for 115VAC)\n\nControl lines: +/- sense, on/off, pwr.fail, high/low margin, and\ncurrent monitor.\n\n(The list price from LH Research is $824.00 each for qty. 1-9)\n\nAsking $500.00 for the pair.\n\nSteve Glicker\nAustin, Texas\n(steve@titan.tsd.arlut.utexas.edu)\n"
    }
}
}

This works and does give as a result the most similar documents. However, this does not completely give the wanted result. What I would like is to return only the results that are above some approximated kappa score, as is common with minhash and LSH.

I tried Googling some results and checked the forums here but when I look for minhash I only get 5 forum results.

Thank you in advance.

Wouter

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.