Buckets of documents grouped by term frequency

NickW · October 14, 2016, 1:31am

I want to segment Elasticsearch results in buckets, such that similar documents (with most matching terms) are grouped together (on an analyzed field) in the results. I'm not sure how to go about having aggregated buckets of individual documents this way.

Here's the basic mapping:

PUT movies
{
  "mappings": {
    "movie": { 
      "properties": { 
        "id":    { "type": "long" }, 
        "title": { "type" : "text" }
      }
    }
  }
}

Now, for example, if a query is done for hunger then the results should be grouped as buckets of matching documents with most number of similar terms:

{
    "buckets": {
        "1": [
            {
                "title": "The Hunger Games"
            },
            {
                "title": "The Hunger Games: Mockingjay"
            },
            {
                "title": "The Hunger Games: Catching Fire"
            }
        ],
        "2": [
            {
                "title": "Aqua Teen Hunger Force"
            },
            {
                "title": "Force of Hunger"
            }
        ],
        "3": [
            {
                "title": "Hunger Pain"
            }
        ],
        :
        :
        :
    }
}

In the above example, similar documents are grouped in separate buckets, based on at-least two matching terms. All matching titles without similar terms are still included in the results as separate buckets (e.g. bucket #3).

Any suggestions are appreciated.

jpountz · October 14, 2016, 2:54pm

I don't think you can do it easily/efficiently with aggregations. However maybe you could use function_score with one filter per term in order to compute a score that is equal to the number of matching terms and sort by it. Then on client side you would get documents in order and could look up the score in order to know how many terms matched.

NickW · October 15, 2016, 6:20am

Thanks @jpountz, I read about function_score and it looks like a script_score might be useful, but I'm not sure how to compute the score (I suppose based on field term vectors or TFIDF of the filtered results). Any ideas?

Topic		Replies	Views
Grouping by similarity Elasticsearch	6	1942	May 20, 2019
Terms aggregation split by coma Elasticsearch aggregations	6	263	March 7, 2024
Sort Terms Aggregation By Parent Docs Count Elasticsearch	2	1223	June 3, 2017
Is it possible to do a BucketSort of an aggregation based on score rather than on document count? Elasticsearch	2	323	March 26, 2019
ElasticSearch ability to reuse the score generated Elasticsearch	2	543	July 5, 2017

Buckets of documents grouped by term frequency

Related topics