Give higher relevancy (sort) to the title which is shorter


(Janaka Bandara) #1

Hi,
Let me explain my scenario,
When I run the following query

GET /ssl/listings/_search
{
    "query": {
        "match": {
             "title": "Hilton"
         }
      }
}

I get the search result of "Hiton Colombo Residences" and "Hilton Colombo".
But I want to give more weight to results which have fewer words in the title field.
So I can get the output as "Hiton Colombo" then "Hilton Colombo Residences".

Anyone knows how to achieve this?

Thanks


(Shane Connelly) #2

All other things being equal, this should be happening by default by virtue of how BM25 works. To show this, here's a simple, reproducible REST set to show it:

PUT test
{
  "settings": {
    "number_of_shards": 1
  }, 
  "mappings": {
    "listings": {
      "properties": {
        "title": {
          "type": "text"
        }
      }
    }
  }
}

POST /test/listings
{
  "title": "Hilton Colombo Residences"
}

POST /test/listings
{
  "title": "Hilton Colombo"
}

GET /test/listings/_search
{
    "query": {
        "match": {
             "title": "Hilton"
         }
      }
}

When you do this (assuming you're on 5.0+, when the default similarity switched from TF/IDF or "classic" to BM25) you should see the Hilton Colombo match come back with a score of 0.19856805 and Hilton Colombo Residences come back with a lower score of 0.16853254. If you're using a "classic" TF/IDF based similarity (< v5.0), you should see the same ordering but different scores (0.70710677 and 0.57735026 respectively).

There are some reasons why the numbers may be different or you may have a different ordering. One plausible explanation is that you may have a more complex query than what you've pasted here and the other elements of your query may be affecting the score more than this particular part. Another thing to look at is that Elasticsearch scores things by default on the shard level and assumes an approximately even distribution of terms across the shards. By default, an index in Elasticsearch has 5 primary shards, so especially in situations where there are a small number of documents, the term distributions in each shard can be skewed. Generally once there are a largeish number of documents with a statistically significant number of each term in each shard, the relevance of each term in each shard converges but it can be off if you end up with (weird) custom routing or (again) a small number of documents. You can specify search_type=dfs_query_then_fetch to your query to get global term statistics before scoring if you like. You may be interested in the Relevance is Broken! page of the definitive guide for a bit more information.

The other thing you should be aware of is the Explain API, which you can use to help debug the scoring.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.