Unclear minhash filter behavior in near-duplicate detection for short texts

m-sean · October 9, 2024, 8:14pm

I am working on a solution for near-duplicate detection of short texts (social media, review, snippetized articles, etc). I have been attempting to configure the MinHash filter per the documentation.
I have set the parameters similar to that of the documentation but reduced the bucket counts to 64 since the texts are pretty short. I also have elasticsearch (7.16) running locally in a docker container which i connect to via IP in the python client.

But I have been noticing strange results on a test data set of a little over 3000 documents.

Once the filter is applied at indexing I search for near-duplicates using the "more_like_this" query for each document id—indicating the text field and custom minhash analyzer. I then cluster all the intersecting query results into "duplicate groups" via transitive closure, to which I send updates to add a duplicate_group_id field for each document.

What I find strange is that if I run this process over again a few times in the same index I can get very different and often better results. The result is never very good after the first pass on an index—particularly I've noticed that smaller "duplicate groups" of less than 5 records are never discovered until the second or third pass of querying and assigning the updated groups. I also have a separate LSH script implemented in pyspark, and the elasticsearch results often don't even come close until the second/third pass.

Presumably the update process to assign the duplicate_group_id values is having some effect, because when I skip that step or assign the field value by uploading the documents to a completely new index the result never changes, and will even regress to that of a first-pass iteration.

Can anyone help me understand why this is happening or what I might be doing wrong? Also, is there a way to score the minhashes with jaccard similarity when searching with the "more_like_this" query so that I can implement a threshold to improve the results ?

m-sean · October 9, 2024, 8:29pm

and example of the configuration:

DEDUPE_SETTINGS = {
    "analysis": {
        "filter": {
            "shingle_filter": {
                "type": "shingle",
                "min_shingle_size": 5,
                "max_shingle_size": 5,
                "output_unigrams": False,
            },
            "min_hash_filter": {
                "type": "min_hash",
                "hash_count": 3,
                "bucket_count": 64,
                "hash_set_size": 1,       
                "with_rotation": True,
            }
        },
        "analyzer": {
            "min_hash_analyzer": {
                "tokenizer": "whitespace",
                "filter": ["shingle_filter", "min_hash_filter"]
            }
        }
    }
}

DEDUPE_MAPPINGS = {
    "properties": {
        "body": {
            "type": "text",
            "analyzer": "min_hash_analyzer",
        },
        "duplicate_group_id": {
            "type": "keyword"
        }
    }
}

Topic		Replies	Views
Near duplicate detection using MinHash and approximated Jaccard score Elasticsearch	1	1424	April 11, 2019
Native approach to search similar documents using Minhash token filter Elasticsearch	1	550	April 28, 2020
[RFC] idea for a near duplicate filter Elasticsearch	2	1290	July 6, 2017
Near duplicate document detection Elasticsearch	2	1527	August 12, 2020
Looking for examples of the native minhash being used for near duplicate detection Elasticsearch	1	461	November 13, 2020

Unclear minhash filter behavior in near-duplicate detection for short texts

Related topics