Finding similarity between docs to avoid data duplication by abusers

(Joao) #1

Hi all, I have a simple (to explain) task and I'm researching into Elasticsearch to see if is the right tool.

The task

In a marketplace platform some (ab)users insert the (text) content they are advertising multiples times a day to get more visibility. They do modifications, like inserting small random strings in the middle of the text to avoid exact matching.

What I want to do is given a document find the max "similarity score" (as a percentage) of that document vs all other documents stored in the datastore, so that I can place for human review those documents with a score above a threshold.

The question

Is there a way I can obtain that max score (as a percentage or a coefficient between 0 and 1) using a Elasticsearch query?

I have already look into using a More Like This query, but it seems like it's not precise enough for this task. Also read about Fuzzy matching, but with a edit distance limited to 2 it can't do the job for me.

Thanks in advance