Scoring for words and fetch docs that get minimum score


(Dekel) #1

I'm looking for a way to give words a score (different score for each word) and find docs that match a minimum score (based on the scores and the words).

The idea is that I might have words with high score (wordA - score=5) and I want all docs that has wordA, but I also have words with lower score (wordB, wordC, wordD - score=2). If i want docs that match minimum of score=4 I would like to get the docs that has wordA or any combination of 2 of (wordB, wordC, wordD).
Same goes with minimum score of 6 - the result should be docs contains wordA and one of (wordB, wordC, wordD), or docs contains all 3 of (wordB, wordC, wordD).

I was trying to boost words combined with minimum_should_match, but it doesn't really do the trick. Any ideas of how I can do that?


(Zachary Tong) #2

If you need the explicit scoring (e.g you don't want TF-IDF derived scores), you can use the function_score to set your own custom scoring based on lists of terms and their weightings. For example:


POST test/test
{
  "title": "wordA wordB wordC"
}

POST test/test
{
  "title": "wordD"
}

POST test/test
{
  "title": "wordB wordC"
}

POST test/test
{
  "title": "wordC"
} 

GET /test/_search
{
  "query": {
    "function_score": {
      "functions": [
        {
          "filter": {
            "terms": {
              "title": ["worda"]
            }
          },
          "weight": 5
        },
        {
          "filter": {
            "terms": {
              "title": ["wordd"]
            }
          },
          "weight": 3
        },
        {
          "filter": {
            "terms": {
              "title": [ "wordb"]
            }
          },
          "weight": 1
        },
        {
          "filter": {
            "terms": {
              "title": [ "wordc"]
            }
          },
          "weight": 2
        }
      ],
      "score_mode": "sum"
    }
  },
  "min_score": 3
}

Each filter function contains a terms which includes the list of tokens for a given weight. The function score is then configured to sum up the weights. Then we configure the query to have a minimum score of 3, which excludes documents that haven't accumulated enough "weight".

You could also use the Terms Lookup functionality to index those term lists, instead of specifying them in the query itself


(system) #3