Hi all, I have a simple (to explain) task and I'm researching into Elasticsearch to see if is the right tool.
The task
In a marketplace platform some (ab)users insert the (text) content they are advertising multiples times a day to get more visibility. They do modifications, like inserting small random strings in the middle of the text to avoid exact matching.
What I want to do is given a document find the max "similarity score" (as a percentage) of that document vs all other documents stored in the datastore, so that I can place for human review those documents with a score above a threshold.
The question
Is there a way I can obtain that max score (as a percentage or a coefficient between 0 and 1) using a Elasticsearch query?
I have already look into using a More Like This query, but it seems like it's not precise enough for this task. Also read about Fuzzy matching, but with a edit distance limited to 2 it can't do the job for me.
In addition to the possibilities you've already mentioned, another option to consider for finding similar text is using text embeddings and calculating similarity with vector fields using cosine similarity script scoring (not yet released) to find the similarity between the vector generated for the new content string queried against the corpus. Given "they are advertising multiples times a day", I imagine the query could also be heavily restricted to just the current day's documents.
You may also want to consider analyzing the text in a way that discards emojis, special characters, or any other such elements that submitters are using to avoid exact matching. That way matching Awesome Ugly ❤️Sweater❤️ $$$ to Awesome Ugly Sweater becomes much easier. In fact, if there is a strong match when using this method and the raw text also considers such characters, it may be an even stronger indication of abusive behavior. This is a challenging problem though, because if multiple people are legitimately trying to advertise something like concert tickets, they may all use almost the same terms to describe the tickets (e.g. Tickets- <musician name> <venue> <date>) and may deliberately add emojis or other special characters to callout attention to their advertisement, and this wouldn't be abusive behavior.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.