Finding similarity between docs to avoid data duplication by abusers

es_noobie · May 16, 2019, 3:00pm

Hi all, I have a simple (to explain) task and I'm researching into Elasticsearch to see if is the right tool.

The task

In a marketplace platform some (ab)users insert the (text) content they are advertising multiples times a day to get more visibility. They do modifications, like inserting small random strings in the middle of the text to avoid exact matching.

What I want to do is given a document find the max "similarity score" (as a percentage) of that document vs all other documents stored in the datastore, so that I can place for human review those documents with a score above a threshold.

The question

Is there a way I can obtain that max score (as a percentage or a coefficient between 0 and 1) using a Elasticsearch query?

I have already look into using a More Like This query, but it seems like it's not precise enough for this task. Also read about Fuzzy matching, but with a edit distance limited to 2 it can't do the job for me.

Thanks in advance

balazs · May 24, 2019, 9:47am

Hi Joao,

In addition to the possibilities you've already mentioned, another option to consider for finding similar text is using text embeddings and calculating similarity with vector fields using cosine similarity script scoring (not yet released) to find the similarity between the vector generated for the new content string queried against the corpus. Given "they are advertising multiples times a day", I imagine the query could also be heavily restricted to just the current day's documents.

You may also want to consider analyzing the text in a way that discards emojis, special characters, or any other such elements that submitters are using to avoid exact matching. That way matching Awesome Ugly ❤️Sweater❤️ $$$ to Awesome Ugly Sweater becomes much easier. In fact, if there is a strong match when using this method and the raw text also considers such characters, it may be an even stronger indication of abusive behavior. This is a challenging problem though, because if multiple people are legitimately trying to advertise something like concert tickets, they may all use almost the same terms to describe the tickets (e.g. Tickets- <musician name> <venue> <date>) and may deliberately add emojis or other special characters to callout attention to their advertisement, and this wouldn't be abusive behavior.

system · June 21, 2019, 9:47am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Precise similarity-based scoring? Elasticsearch	1	308	July 6, 2017
Document Similarity Elasticsearch	1	366	July 6, 2017
Finding similar documents with Elasticsearch Elasticsearch	4	398	July 6, 2017
Similar documentation detection System Elasticsearch	6	401	July 6, 2017
Finding documents _almost_ the same Elasticsearch	5	2757	December 13, 2016

Finding similarity between docs to avoid data duplication by abusers

Related topics