Group documents by similarity using Elser

Hello,
Is it possible to use ML tokens generated by ELSER (Elastic Learned Sparse EncodeR) to group documents?
Let's imagine I have the following list of documents:

[
{ "name" : "Apple", price: 1234, "nameTokens" : <tokens generated by AI> },
{ "name" : "Apple Inc.", price: 654, "nameTokens" : <tokens generated by AI> },
{ "name" : "VentionCloud", price: 73, "nameTokens" : <tokens generated by AI> },
{ "name" : "vention cloud", price: 6534, "nameTokens" : <tokens generated by AI> },
{ "name" : "ventionclud inc.", price: 1434, "nameTokens" : <tokens generated by AI> }
]

The documents were created via ingest pipeline where Elser generated tokens based on 'name' field. I know that I can do semantic search using Elser:
GET my-index/_search

{
   "query":{
      "text_expansion":{
         "ml.tokens":{
            "model_id":".elser_model_1",
            "model_text":"How to avoid muscle soreness after running?"
         }
      }
   }
}

But I want it to return similar grouped documents, let's say top 7 by similarity score.
The desired output will be:

{
"Group 1" : [
{ "name" : "Apple", price: 1234, "nameTokens" : <tokens generated by AI> },
{ "name" : "Apple Inc.", price: 654, "nameTokens" : <tokens generated by AI> }
],
"Group 2": [
{ "name" : "VentionCloud", price: 73, "nameTokens" : <tokens generated by AI> },
{ "name" : "vention cloud", price: 6534, "nameTokens" : <tokens generated by AI> },
{ "name" : "ventionclud inc.", price: 1434, "nameTokens" : <tokens generated by AI> }
]
}

I guess I need to use aggregations here, but I couldn't find any info about using Elser in aggregations.

Hi Anton,

Thanks for using ELSER.

Have you tried the combination of histogram and top hits aggregations on the text_expansion search results? Something like this:

GET my-index/_search
{
  "query": {
    "text_expansion": {
      "ml.tokens": {
        "model_id": ".elser_model_1",
        "model_text": "How to avoid muscle soreness after running?"
      }
    }
  },
  "aggs": {
    "histogram_by_score": {
      "histogram": {
        "script": "_score",
        "interval": 5,
        "min_doc_count": 1
      },
      "aggs": {
        "top_7_documents": {
          "top_hits": {
            "_source": [
              "name",
              "price",
              "nameTokens"
            ],
            "size": 7
          }
        }
      }
    }
  }
}

It should return grouped documents (top 7) with fixed interval (5) by _score.

You also can use range aggregation plus top hits aggregation if you dont want to use fixed interval.

1 Like

Thanks for the answer!
It helped me a lot, but I'm wondering if we can use generated elser tokens in aggregation?
What I'm trying to achieve is to be able to group documents with similar names using semantic search.

In your example we use semantic search with specific text and then make an aggregation on the result of the query. I'm not sure if it's possible to group all existing documents with one query.
I guess if we have only one document in the group we can skip it.

Let's imaging that we have the same structure like in the question, but we have 10k records with different companies' names. There should be dublicates and I want to get all possible dublicates to dedup or merge them

So, the output could be like that:

{
"Group 1" : [
{ "name" : "Apple", price: 1234, "nameTokens" : <tokens generated by AI> },
{ "name" : "Apple Inc.", price: 654, "nameTokens" : <tokens generated by AI> }
],
"Group 2": [
{ "name" : "VentionCloud", price: 73, "nameTokens" : <tokens generated by AI> },
{ "name" : "vention cloud", price: 6534, "nameTokens" : <tokens generated by AI> },
{ "name" : "ventionclud inc.", price: 1434, "nameTokens" : <tokens generated by AI> }
],

...

"Group N": [
{ "name" : "companyNameN", price: N, "nameTokens" : <tokens generated by AI> },
{ "name" : "companyNameN", price: N, "nameTokens" : <tokens generated by AI> },
]
}

After some experiments I found out that @wei.wang response was fully correct. The idea behind the query is that we make query using ML elser model and get the array of companies with similarty score.

"text_expansion": {
      "ml.tokens": {
        "model_id": ".elser_model_1",
        "model_text": "How to avoid muscle soreness after running?"
      }
    }

The similar documets will have the closest score to each other. After we get the array of scores we make aggregation.

"histogram": {
        "script": "_score",
        "interval": 5,
        "min_doc_count": 1
      }

Where "interval" is a step for each bucket. The less this value the more precise will be the result. Finally we return top 7 documents based on several properties of documents.

"top_hits": {
            "_source": [
              "name",
              "price",
              "nameTokens"
            ]
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.