Group documents by similarity using Elser

Anton_Dambrouski · September 15, 2023, 10:16am

Thanks for the answer!
It helped me a lot, but I'm wondering if we can use generated elser tokens in aggregation?
What I'm trying to achieve is to be able to group documents with similar names using semantic search.

In your example we use semantic search with specific text and then make an aggregation on the result of the query. I'm not sure if it's possible to group all existing documents with one query.
I guess if we have only one document in the group we can skip it.

Let's imaging that we have the same structure like in the question, but we have 10k records with different companies' names. There should be dublicates and I want to get all possible dublicates to dedup or merge them

So, the output could be like that:

{
"Group 1" : [
{ "name" : "Apple", price: 1234, "nameTokens" : <tokens generated by AI> },
{ "name" : "Apple Inc.", price: 654, "nameTokens" : <tokens generated by AI> }
],
"Group 2": [
{ "name" : "VentionCloud", price: 73, "nameTokens" : <tokens generated by AI> },
{ "name" : "vention cloud", price: 6534, "nameTokens" : <tokens generated by AI> },
{ "name" : "ventionclud inc.", price: 1434, "nameTokens" : <tokens generated by AI> }
],

...

"Group N": [
{ "name" : "companyNameN", price: N, "nameTokens" : <tokens generated by AI> },
{ "name" : "companyNameN", price: N, "nameTokens" : <tokens generated by AI> },
]
}