Random sorting for matched and unmatched parts of result set


(Eugene Prokopiev) #1

Hi,

I have index with 25 000 documents which are looks like:

GET locations/location/10/_source
{
  "country_code": "US",
  "id": "10",
  "continent_code": "NA",
  "city_name": "New York"
} 

I need to get cities from two countries in random order first and all other cities in random order next. Now I can simple show cities from two countries and cities next by query with "should" and "minimum_should_match":

POST locations/_search?size=100
{
  "query": {
    "bool": {
      "should": [
        {
          "match_all": {}
        },
        {
          "term": {
            "country_code": "mz"
          }
        },
        {
          "term": {
            "country_code": "mg"
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

But order of first 100 documents is strange: I see some MG cities first, some MZ next, some MG again and PT at the end. Why order is so strange?

Is it possible to see MG and MZ cities in random order and all other cities in random order too? I know about "function_score", but I can't understand how to apply two "random_score" for both parts of results - matched and unmatched.


(Daniel Mitterdorfer) #2

Hi,

I've tried the following (similar) example which should get you started:

POST /locations/_bulk
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "AT", "city": "Innsbruck"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "AT", "city": "Salzburg"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "DE", "city": "München"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "DE", "city": "Berlin"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "DE", "city": "Hamburg"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "CH", "city": "Bern"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "CH", "city": "Zürich"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "US", "city": "San Francisco"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "US", "city": "New York"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "SE", "city": "Stockholm"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "SE", "city": "Malmö"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "NO", "city": "Oslo"}

In the example we want to sort cities from "AT" and "DE" first, then the rest:

GET /location/type/_search
{
   "query": {
      "function_score": {
         "query": {
            "match_all": {}
         },
         "functions": [
            {
               "filter": {
                  "match": {
                     "country": "AT"
                  }
               },
               "random_score": {},
               "weight": 100
            },
            {
               "filter": {
                  "match": {
                     "country": "DE"
                  }
               },
               "random_score": {},
               "weight": 100
            },
            {
               "filter": {
                  "match_all": {}
               },
               "random_score": {},
               "weight": 5
            }
         ],
         "boost_mode": "max"
      }
   }
}

"AT" and "DE" get a random score with a higher weight. All documents get a random score with a lower weight. We take the maximum of both scores so the high-priority countries are ordered first. Note that there is a (very) slim chance that a high-priority country is still ordered below an ordinary one. I guess you could fix this with a script score but that's up to you to decide whether you can tolerate that.

Daniel


(Eugene Prokopiev) #3

Thank you, but can't understand how to avoid possible duplication with script score. Can you show me example?

Suppose "AT" and "DE" must have higher weight, "NO" must have lower weight and all other countries must be between them. How functions will be in this case?


(Eugene Prokopiev) #4

Btw, is it possible to use MLT queries instead of simple match filters? I need to set higher weight for MLT queries with 'like', lower weight for MLT queries with 'unlike' and middle weight for "match_all" (but exclude results or MLT with 'like' and 'unlike' from "match_all").

I see only one way to do this: run both MLT queries with 'like' and 'unlike' first, and create bool query with "must_not" and "ids" with ids from MLT queries response next.

Maybe something more clear is possible?


(Daniel Mitterdorfer) #5

Hi @enp,

I'm not sure I get what you mean. Do you want to use script score in a function score query? If yes, then you just replace the random_score bits with script_score and write your custom scoring script. I also wouldn't really bother with this duplication as I think it's pretty straightforward to understand this way.

It should be something along these lines:

GET /location/type/_search
{
   "query": {
      "function_score": {
         "query": {
            "match_all": {}
         },
         "functions": [
            {
               "filter": {
                  "match": {
                     "country": "AT"
                  }
               },
               "random_score": {},
               "weight": 1000
            },
            {
               "filter": {
                  "match": {
                     "country": "DE"
                  }
               },
               "random_score": {},
               "weight": 1000
            },
            {
               "filter": {
                  "match": {
                     "country": "NO"
                  }
               },
               "random_score": {},
               "weight": 0.001
            },            
            {
               "filter": {
                  "match_all": {}
               },
               "random_score": {},
               "weight": 1
            }
         ],
         "score_mode": "first",
         "boost_mode": "sum"
      }
   }
}

I adjusted the weights a bit and changed the score_mode to "first" so the first matching function score is used (otherwise the "NO" case would not be possible).

Daniel


(Daniel Mitterdorfer) #6

Hi,

Hmm, you could run a dis max query and have the "like" MLT query, "unlike" MLT query and the match_all query as subqueries with different boost values?

Daniel


(Eugene Prokopiev) #7

Thank you, will try all your recommendations


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.