I am working on a solution for near-duplicate detection of short texts (social media, review, snippetized articles, etc). I have been attempting to configure the MinHash filter per the documentation.
I have set the parameters similar to that of the documentation but reduced the bucket counts to 64 since the texts are pretty short. I also have elasticsearch (7.16) running locally in a docker container which i connect to via IP in the python client.
But I have been noticing strange results on a test data set of a little over 3000 documents.
Once the filter is applied at indexing I search for near-duplicates using the "more_like_this" query for each document id—indicating the text field and custom minhash analyzer. I then cluster all the intersecting query results into "duplicate groups" via transitive closure, to which I send updates to add a duplicate_group_id
field for each document.
What I find strange is that if I run this process over again a few times in the same index I can get very different and often better results. The result is never very good after the first pass on an index—particularly I've noticed that smaller "duplicate groups" of less than 5 records are never discovered until the second or third pass of querying and assigning the updated groups. I also have a separate LSH script implemented in pyspark, and the elasticsearch results often don't even come close until the second/third pass.
Presumably the update process to assign the duplicate_group_id
values is having some effect, because when I skip that step or assign the field value by uploading the documents to a completely new index the result never changes, and will even regress to that of a first-pass iteration.
Can anyone help me understand why this is happening or what I might be doing wrong? Also, is there a way to score the minhashes with jaccard similarity when searching with the "more_like_this" query so that I can implement a threshold to improve the results ?