Hi all,
I am trying to find near duplicates of large documents. First I split the data using a whitespace tokenizer and subsequently apply a minhash token filter in a custom analyzer. Then the analyzer is mapped to a field named "title".
{
"settings": {
"analysis": {
"filter": {
"my_minhash_filter": {
"type": "min_hash",
"hash_count": 1,
"bucket_count": 128,
"hash_set_size": 1,
"with_rotation": true
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"my_minhash_filter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "my_analyzer"
}
}
}
}
}
Then, using the python API I fill the index with some example data
from elasticsearch import Elasticsearch
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
for index, example in enumerate(twenty_train['data']):
es.create(index='test_analyzer', doc_type='_doc', id=index, body={"title": example})
To check whether this works I tried to retrieve some near duplicates
{
"query": {
"more_like_this": {
"fields": ["title"],
"like": "From: steve@titan.tsd.arlut.utexas.edu (Steve Glicker)\nSubject: 2 1000W Power Supplies\nNntp-Posting-Host: rooster\nOrganization: Applied Research Labs, The University of Texas at Austin\nDistribution: misc\nLines: 14\n\nTwo LH Research SM11-1 power supplies (SM10 series).\n\n1000W, 5V, 200A (currently wired for 115VAC)\n\nControl lines: +/- sense, on/off, pwr.fail, high/low margin, and\ncurrent monitor.\n\n(The list price from LH Research is $824.00 each for qty. 1-9)\n\nAsking $500.00 for the pair.\n\nSteve Glicker\nAustin, Texas\n(steve@titan.tsd.arlut.utexas.edu)\n"
}
}
}
This works and does give as a result the most similar documents. However, this does not completely give the wanted result. What I would like is to return only the results that are above some approximated kappa score, as is common with minhash and LSH.
I tried Googling some results and checked the forums here but when I look for minhash I only get 5 forum results.
Thank you in advance.
Wouter