Boosting a field yields bizarre results

Howdy!

I'm trying to boost search results by a status field (it's an enumeration). My query is as follows:

{
  "query" : {
    "bool" : {
      "must" : ...,
      "should" : [
        {
          "term" : {
            "status" : { "value" : "Valid", "boost" : 5 }
          }
        },
        {
          "term" : {
            "status" : { "value" : "Pending", "boost" : 4 }
          }
        },
        {
          "term" : {
            "status" : { "value" : "Expired", "boost" : 3 }
          }
        },
        {
          "term" : {
            "status" : { "value" : "Invalid", "boost" : 2 }
          }
        },
        {
          "term" : {
            "status" : { "value" : "Unknown", "boost" : 1 }
          }
        }
      ]
    }
  }
}

However, this does not sort the results as I would expect. I get the results as follows:

{
  "hits" : [
    "max_score": 2.1227207,
    "hits": [
      {
        "_score": 2.1227207,
        "_source": { "status": "Pending" }
      },
      {
        "_score": 1.5702559,
        "_source": { "status": "Unknown" }
      },
      {
        "_score": 1.4841971,
        "_source": { "status": "Valid" }
      },
      {
        "_score": 1.4767004,
        "_source": { "status": "Expired" }
      }
    ]
  ]
}

Obviously I was aiming to have Valid results first, then Pending, ..., and last Unknown.

If I remove the should boolean-array of boosted statuses altogether, then all the results have the same score (as I would suspect).

Am I missing something? Why is this query not working as I would suspect? https://www.elastic.co/guide/en/elasticsearch/guide/current/_boosting_query_clauses.html would IMO suggest that this is something that would be possible...

Thank you!

EDIT:

  • the status-field mapping is not_analyzed that's why I use case-sensitive statuses in the query

Someone in the elasticsearch IRC-channel suggested to use the explain API. These are the results, if someone can make some sense out of them.

So this the explain-result for a document with status: "Valid" (which I assume would be sorted as first):

And this the Pending-result which gets a higher score than Valid, eventhough Valid should have a higher boost:

Ok... So I read https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html and I guess the IDF-factor is to blame here.

If I adjust the boost values further from each other, then it will work. Feels kludgy, though... Any better options out there?

Yeah, this can be unintuitive at first. Like you discovered, the IDF can sometimes mess with your scoring, because it is trying to weight the relative rarity of the terms.

If you just don't care about the IDF, the easiest approach is to wrap your term queries in constant_score queries. That will assign a score of 1 if the doc matches, 0 otherwise. The TF/IDF are not taken into account. You can also boost them relative to each other for proper ordering. See this for more details: https://www.elastic.co/guide/en/elasticsearch/guide/current/ignoring-tfidf.html

You could also use the function_scoreto setup an array of filters which are each weighted differently. The advantage to function_score is that it is a lot more flexible (you can include many different sub-clauses, mathematical functions, etc) and can apply non-normalized boosts. The downside is that it's a lot more complicated :slight_smile:

More details about function_score here: https://www.elastic.co/guide/en/elasticsearch/guide/current/function-score-filters.html and here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html

(Note: some of the syntax in the Guide is a bit out-dated, since yours truly has been slacking and haven't gotten everything updated yet :slight_smile: Verify with the reference docs to make sure the syntax is still valid)

1 Like