Random sorting for matched and unmatched parts of result set

enp · January 4, 2017, 1:02pm

Hi,

I have index with 25 000 documents which are looks like:

GET locations/location/10/_source
{
  "country_code": "US",
  "id": "10",
  "continent_code": "NA",
  "city_name": "New York"
}

I need to get cities from two countries in random order first and all other cities in random order next. Now I can simple show cities from two countries and cities next by query with "should" and "minimum_should_match":

POST locations/_search?size=100
{
  "query": {
    "bool": {
      "should": [
        {
          "match_all": {}
        },
        {
          "term": {
            "country_code": "mz"
          }
        },
        {
          "term": {
            "country_code": "mg"
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

But order of first 100 documents is strange: I see some MG cities first, some MZ next, some MG again and PT at the end. Why order is so strange?

Is it possible to see MG and MZ cities in random order and all other cities in random order too? I know about "function_score", but I can't understand how to apply two "random_score" for both parts of results - matched and unmatched.

danielmitterdorfer · January 5, 2017, 8:50am

Hi,

I've tried the following (similar) example which should get you started:

POST /locations/_bulk
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "AT", "city": "Innsbruck"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "AT", "city": "Salzburg"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "DE", "city": "München"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "DE", "city": "Berlin"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "DE", "city": "Hamburg"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "CH", "city": "Bern"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "CH", "city": "Zürich"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "US", "city": "San Francisco"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "US", "city": "New York"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "SE", "city": "Stockholm"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "SE", "city": "Malmö"}
{ "index" : { "_index" : "location", "_type" : "type" } }
{"country": "NO", "city": "Oslo"}

In the example we want to sort cities from "AT" and "DE" first, then the rest:

GET /location/type/_search
{
   "query": {
      "function_score": {
         "query": {
            "match_all": {}
         },
         "functions": [
            {
               "filter": {
                  "match": {
                     "country": "AT"
                  }
               },
               "random_score": {},
               "weight": 100
            },
            {
               "filter": {
                  "match": {
                     "country": "DE"
                  }
               },
               "random_score": {},
               "weight": 100
            },
            {
               "filter": {
                  "match_all": {}
               },
               "random_score": {},
               "weight": 5
            }
         ],
         "boost_mode": "max"
      }
   }
}

"AT" and "DE" get a random score with a higher weight. All documents get a random score with a lower weight. We take the maximum of both scores so the high-priority countries are ordered first. Note that there is a (very) slim chance that a high-priority country is still ordered below an ordinary one. I guess you could fix this with a script score but that's up to you to decide whether you can tolerate that.

Daniel

enp · January 5, 2017, 12:26pm

Thank you, but can't understand how to avoid possible duplication with script score. Can you show me example?

Suppose "AT" and "DE" must have higher weight, "NO" must have lower weight and all other countries must be between them. How functions will be in this case?

enp · January 5, 2017, 12:45pm

Btw, is it possible to use MLT queries instead of simple match filters? I need to set higher weight for MLT queries with 'like', lower weight for MLT queries with 'unlike' and middle weight for "match_all" (but exclude results or MLT with 'like' and 'unlike' from "match_all").

I see only one way to do this: run both MLT queries with 'like' and 'unlike' first, and create bool query with "must_not" and "ids" with ids from MLT queries response next.

Maybe something more clear is possible?

danielmitterdorfer · January 10, 2017, 12:04pm

Hi @enp,

I'm not sure I get what you mean. Do you want to use script score in a function score query? If yes, then you just replace the random_score bits with script_score and write your custom scoring script. I also wouldn't really bother with this duplication as I think it's pretty straightforward to understand this way.

It should be something along these lines:

GET /location/type/_search
{
   "query": {
      "function_score": {
         "query": {
            "match_all": {}
         },
         "functions": [
            {
               "filter": {
                  "match": {
                     "country": "AT"
                  }
               },
               "random_score": {},
               "weight": 1000
            },
            {
               "filter": {
                  "match": {
                     "country": "DE"
                  }
               },
               "random_score": {},
               "weight": 1000
            },
            {
               "filter": {
                  "match": {
                     "country": "NO"
                  }
               },
               "random_score": {},
               "weight": 0.001
            },            
            {
               "filter": {
                  "match_all": {}
               },
               "random_score": {},
               "weight": 1
            }
         ],
         "score_mode": "first",
         "boost_mode": "sum"
      }
   }
}

I adjusted the weights a bit and changed the score_mode to "first" so the first matching function score is used (otherwise the "NO" case would not be possible).

Daniel

danielmitterdorfer · January 10, 2017, 12:13pm

Hi,

Hmm, you could run a dis max query and have the "like" MLT query, "unlike" MLT query and the match_all query as subqueries with different boost values?

Daniel

enp · January 10, 2017, 6:48pm

Thank you, will try all your recommendations

system · February 7, 2017, 6:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Random sort Elasticsearch	3	6477	January 31, 2017
Elasticsearch - Result order Elasticsearch	1	341	November 21, 2019
Randomize results with the same score Elasticsearch	2	2264	December 8, 2017
Sorting a random set of documents Elasticsearch	2	727	July 6, 2017
Fucntion Score with Random Score is ignoring a simple sort Elasticsearch	4	3052	July 1, 2019

Random sorting for matched and unmatched parts of result set

Related topics