Buckets of documents grouped by term frequency


#1

I want to segment Elasticsearch results in buckets, such that similar documents (with most matching terms) are grouped together (on an analyzed field) in the results. I'm not sure how to go about having aggregated buckets of individual documents this way.

Here's the basic mapping:

PUT movies
{
  "mappings": {
    "movie": { 
      "properties": { 
        "id":    { "type": "long" }, 
        "title": { "type" : "text" }
      }
    }
  }
}

Now, for example, if a query is done for hunger then the results should be grouped as buckets of matching documents with most number of similar terms:

{
    "buckets": {
        "1": [
            {
                "title": "The Hunger Games"
            },
            {
                "title": "The Hunger Games: Mockingjay"
            },
            {
                "title": "The Hunger Games: Catching Fire"
            }
        ],
        "2": [
            {
                "title": "Aqua Teen Hunger Force"
            },
            {
                "title": "Force of Hunger"
            }
        ],
        "3": [
            {
                "title": "Hunger Pain"
            }
        ],
        :
        :
        :
    }
}

In the above example, similar documents are grouped in separate buckets, based on at-least two matching terms. All matching titles without similar terms are still included in the results as separate buckets (e.g. bucket #3).

Any suggestions are appreciated.


(Adrien Grand) #2

I don't think you can do it easily/efficiently with aggregations. However maybe you could use function_score with one filter per term in order to compute a score that is equal to the number of matching terms and sort by it. Then on client side you would get documents in order and could look up the score in order to know how many terms matched.


#3

Thanks @jpountz, I read about function_score and it looks like a script_score might be useful, but I'm not sure how to compute the score (I suppose based on field term vectors or TFIDF of the filtered results). Any ideas?


(system) #4