Text similarity with Elasticseacrh

Hi everyone,

Is there a way in Elastic to aggregate the values in a particular field by the similarity between the words?
I mean if for example I have a field called "userMailbox", I would like to get groups of mailboxes that are similar but not identical,

TIA

Can you give an example?
What is similar?

Hi,
thanks for the response,

example:
Let’s say I want to aggregate all the mailboxes in the UserMailbox field, so it should look something like this

alex1@gmail.com bucket 1
alexa@gmail.com
Aleks@gmail.com
alex@gmail.co
aLex@gmail.com

yanna@aol.com bucket 2
yana@aol.com
Yona@aox.co

admin@aol.com bucket 3
aDmins@aol.com
admin1@aol.com

There's no easy out-of-the-box answer to this.

One approach that might be of interest is to use a phonetic analyzer

You'd have to do some work to make aggregations work with it - perhaps calling the analyze api in your client code to get the phonetic tokens and concatenate them into a single keyword.

gmail.co == gmail.com is probably some custom normalisation you'd have to do in your client code.

Ingest piplelines can help wrap up some of the content processing logic if you don't want to stand up a custom data-preparation process.

Thank you very much Mark,
As i deeper into the subject it seems to me that it would be better for me to do the text clustering out of Elastic...
Do you have any nice advice you can give me on this issue?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.