Dec 6th, 2023: [EN] "He's making a list 🎶" and now needs to sort it — with Elasticsearch

While you might be picturing Santa Claus when singing this song, European folklore, particularly in the Alpine regions, has the two legendary figures Saint Nicholas and Krampus. Saint Nicholas, symbolizing generosity and kindness, brings gifts and joy to well-behaved children on December 6th — today! In contrast, Krampus, a horned and menacing creature, serves as a warning to misbehaving children the night before. Together, they create a unique and enduring tradition that encourages generosity and good behavior during the festive season.

Of course, they are using Elasticsearch to track the misbehaving and well-behaved children. But how do they prioritize their visits? This article dives into some of the more interesting sorting options with Elasticsearch — for all options see the official documentation. While the example uses a Saint Nicholas and Krampus theme, the concepts apply to many other scenarios.

Example Dataset

ChatGPT can generate the mapping and dataset for us using the prompt:

generate an elasticsearch mapping and bulk query with 10 christmas themed characters (like the grinch) containing the fields: name, date of birth, geolocation, timezone, behavior (can be good, bad, or mixed), and points (an array of mixed +1 or -1 values)

Mapping

PUT /christmas_characters
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "date_of_birth": {
        "type": "date"
      },
      "geolocation": {
        "type": "geo_point"
      },
      "timezone": {
        "type": "keyword"
      },
      "behavior": {
        "type": "keyword"
      },
      "points": {
        "type": "integer"
      }
    }
  }
}

Bulk Query

POST /christmas_characters/_bulk
{"index":{"_index":"christmas_characters"}}
{"name": "The Grinch", "date_of_birth": "1966-12-01", "geolocation": "40.7128,-74.0060", "timezone": "America/New_York", "behavior": "bad", "points": [1, -1, -1, -1, -1]}
{"index":{"_index":"christmas_characters"}}
{"name": "Santa Claus", "date_of_birth": "270-12-25", "geolocation": "90.0,0.0", "timezone": "UTC", "behavior": "good", "points": [1, 1, 1, 1, 1]}
{"index":{"_index":"christmas_characters"}}
{"name": "Rudolph the Red-Nosed Reindeer", "date_of_birth": "1939-05-01", "geolocation": "89.0,-135.0", "timezone": "America/Anchorage", "behavior": "good", "points": [1, 1, 1, 1, 1]}
{"index":{"_index":"christmas_characters"}}
{"name": "Ebenezer Scrooge", "date_of_birth": "1803-12-19", "geolocation": "51.5074,-0.1278", "timezone": "Europe/London", "behavior": "bad", "points": [-1, -1, -1, -1, -1]}
{"index":{"_index":"christmas_characters"}}
{"name": "Frosty the Snowman", "date_of_birth": "1969-12-07", "geolocation": "40.7128,-74.0060", "timezone": "America/New_York", "behavior": "good", "points": [1, 1, 1, 1, 1]}
{"index":{"_index":"christmas_characters"}}
{"name": "Cindy Lou Who", "date_of_birth": "2000-12-20", "geolocation": "35.682839,-83.516650", "timezone": "America/New_York", "behavior": "good", "points": [1, 1, 1, 1, 1]}
{"index":{"_index":"christmas_characters"}}
{"name": "Jack Skellington", "date_of_birth": "1993-10-13", "geolocation": "34.0522,-118.2437", "timezone": "America/Los_Angeles", "behavior": "mixed", "points": [1, -1, 1, -1, 1]}
{"index":{"_index":"christmas_characters"}}
{"name": "Buddy the Elf", "date_of_birth": "2003-12-10", "geolocation": "40.7128,-74.0060", "timezone": "America/New_York", "behavior": "good", "points": [1, 1, 1, 1, 1]}
{"index":{"_index":"christmas_characters"}}
{"name": "Scrooge McDuck", "date_of_birth": "1947-12-09", "geolocation": "28.6139,77.2090", "timezone": "Asia/Kolkata", "behavior": "mixed", "points": [1, -1, 1, -1, 1]}
{"index":{"_index":"christmas_characters"}}
{"name": "The Nutcracker", "date_of_birth": "1816-12-06", "geolocation": "59.9343,30.3351", "timezone": "Europe/Moscow", "behavior": "good", "points": [1, 1, 1, 1, 1]}

You only need to make one minor correction: Santa Claus's date of birth needs a leading 0 ("date_of_birth": "0270-12-25") to avoid the error failed to parse date field [270-12-25] with format [strict_date_optional_time||epoch_millis] when indexing the data.

Sort the Visits

Let's see how we can sort the visits of Saint Nicholas and Krampus to find out if you deserve a present or a piece of coal — the traditional gift of Krampus to misbehaving kids.

By Age

Maybe the younger you are, the less patience you have to wait. Or the earlier you need to go to bed. So let's match all documents with a match_all and sort by the date_of_birth field in descending order.

GET /christmas_characters/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "date_of_birth": {
        "order": "desc"
      }
    }
  ]
}

The response has the results (here, only a subset for better readability) in the expected order — youngest to oldest person in the dataset. The sort attribute contains the value(s) that were sorted on — in this case the epoch timestamp in milliseconds:

      {
          "name": "Buddy the Elf",
          "date_of_birth": "2003-12-10"
        },
        "sort": [
          1071014400000
        ]
      },
      {
          "name": "Cindy Lou Who",
          "date_of_birth": "2000-12-20"
        },
        "sort": [
          977270400000
        ]
      },
      ...

By Points and Age

Maybe you want to start with those who have behaved best, represented by the array of points with good (1) and bad (-1) behavior. Here, we can sort by the sum of the values of the array, and if multiple sums have an equal value, add a secondary sort condition based on the age again.

GET /christmas_characters/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "points": {
        "order": "desc",
        "mode": "sum"
      },
      "date_of_birth": {
        "order": "desc"
      }
    }
  ]
}

Now the results are sorted by the sum of points and then the age.

      {
          "name": "Buddy the Elf",
          "date_of_birth": "2003-12-10",
          "points": [
            1,
            1,
            1,
            1,
            1
          ]
        },
        "sort": [
          5,
          1071014400000
        ]
      },
      {
          "name": "Cindy Lou Who",
          "date_of_birth": "2000-12-20",
          "points": [
            1,
            1,
            1,
            1,
            1
          ]
        },
        "sort": [
          5,
          977270400000
        ]
      },
      ...

By Proximity

For practical reasons, it might be easiest to sort by proximity. Saint Nicholas is allegedly living at the North Pole — 90 degrees North latitude and 0 degrees East longitude as a symbolic geopoint for his "home":

GET /christmas_characters/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "_geo_distance" : {
          "geolocation" : [0, 90],
          "order" : "asc",
          "unit" : "km",
          "distance_type" : "arc"
      }
    }
  ]
}

Note the order of longitude and latitude in geolocation (I will always get this wrong on the first try), and then we want to order by the closest person (so ascending distance) in kilometers based on the more exact but slower arc (rather than plane) distance.

      {
          "name": "Santa Claus",
          "geolocation": "90.0,0.0",
        },
        "sort": [
          0
        ]
      },
      {
          "name": "Rudolph the Red-Nosed Reindeer",
          "geolocation": "89.0,-135.0",
        },
        "sort": [
          111.19508242688518
        ]
      },
      {
          "name": "The Nutcracker",
          "geolocation": "59.9343,30.3351",
        },
        "sort": [
          3343.157910029211
        ]
      },
      {
          "name": "Ebenezer Scrooge",
          "geolocation": "51.5074,-0.1278",
        },
        "sort": [
          4280.187726752484
        ]
      },
      {
          "name": "The Grinch",
          "geolocation": "40.7128,-74.0060",
        },
        "sort": [
          5480.494135659516
        ]
      },
      {
          "name": "Frosty the Snowman",
          "geolocation": "40.7128,-74.0060",
        },

Through Scripting

For maximum flexibility, Elasticsearch's scripting language, Painless, gives you all the options you want. For example, if you sort by the attributes good, mixed, and then bad (in this order), no existing field would let you do that. But with a script, you can assign a numeric value (at query-time) to each attribute and then sort based on that. And add the age tiebreaker condition again.

GET /christmas_characters/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "_script": {
        "type": "number",
        "script": {
          "lang": "painless",
          "source": """
            if(doc['behavior'].value == 'good'){
              return 1;
            } else if(doc['behavior'].value == 'mixed'){
              return 2;
            } else {
              return 3;
            }
          """
        },
        "order": "asc"
      }
    },
    {
      "date_of_birth": {
        "order": "desc"
      }
    }
  ]
}

Only do this if you have to though — sorting by scripts is slower than on indexed fields, and using Painless often gives the opposite feeling of its name. If you want to sort like this frequently, add the field explicitly at ingestion time.

      {
          "name": "Buddy the Elf",
          "date_of_birth": "2003-12-10",
          "behavior": "good"
        },
        "sort": [
          1,
          1071014400000
        ]
      },
      {
          "name": "Cindy Lou Who",
          "date_of_birth": "2000-12-20",
          "behavior": "good"
        },
        "sort": [
          1,
          977270400000
        ]
      },
      {
          "name": "Frosty the Snowman",
          "date_of_birth": "1969-12-07",
          "behavior": "good"
        },
        "sort": [
          1,
          -2160000000
        ]
      },
      ...

On a Runtime Field

You can do the same as in the previous example on a (query-time) Runtime field using Painless again — though this example sorts by timezone so that everyone can get a visit in their evening. This code snippet also introduces the concept of missing, commonly with the magic values _first or _last, but it can also be a static value like in this example.

GET /christmas_characters/_search
{
  "query": {
    "match_all": {}
  },
  "runtime_mappings": {
    "numeric_timezone": {
      "type": "double",
      "script": {
        "source": """
          if(doc['timezone'].value == 'America/New_York'){
            emit(-5);
          } else if(doc['timezone'].value == 'UTC' || doc['timezone'].value == 'Europe/London'){
            emit(0);
          } else if(doc['timezone'].value == 'Asia/Kolkata'){
            emit(5.5)
          }
        """
      }
    }
  },
  "sort": [
    {
      "numeric_timezone": {
        "order": "desc",
        "missing": -0.1
      }
    },
    {
      "date_of_birth": {
        "order": "desc"
      }
    }
  ]
}

The mapping is a bit cumbersome with lots of values but the general approach is again a good escape hatch if your data isn't indexed the way you need it during query-time.

      {
          "name": "Scrooge McDuck",
          "date_of_birth": "1947-12-09",
          "timezone": "Asia/Kolkata"
        },
        "sort": [
          5.5,
          -696297600000
        ]
      },
      {
          "name": "Ebenezer Scrooge",
          "date_of_birth": "1803-12-19",
          "timezone": "Europe/London"
        },
        "sort": [
          0,
          -5239641600000
        ]
      },
      {
          "name": "Santa Claus",
          "date_of_birth": "0270-12-25",
          "timezone": "UTC"
        },
        "sort": [
          0,
          -53615865600000
        ]
      },
      {
          "name": "Jack Skellington",
          "geolocation": "34.0522,-118.2437"
        },
        "sort": [
          -0.1,
          750470400000
        ]
      },
      ...

Introducing ES|QL

Before wrapping up, there is a new query language in Elasticsearch: the Elasticsearch Query Language (ES|QL). It has a new endpoint (_query), a new and hopefully more compact syntax to write queries, and different output options.

Writing a similar query to the first Painless example would look like this — using a CASE statement within an EVAL. Without going into too many details here, this is a procedural language that passes the results into the next statement.

POST _query?format=txt
{
  "query": """
    FROM christmas_characters
    | EVAL numeric_behavior = CASE(
        behavior == "good", 1,
        behavior == "mixed", 2,
        3
      )
    | SORT numeric_behavior ASC, date_of_birth DESC
    | KEEP name, behavior, numeric_behavior, date_of_birth
    | LIMIT 10
  """
}

And the (configurable) output format can be much more concise than pretty printed JSON.

             name             |   behavior    |numeric_behavior|     date_of_birth      
------------------------------+---------------+----------------+------------------------
Buddy the Elf                 |good           |1               |2003-12-10T00:00:00.000Z
Cindy Lou Who                 |good           |1               |2000-12-20T00:00:00.000Z
Frosty the Snowman            |good           |1               |1969-12-07T00:00:00.000Z
Rudolph the Red-Nosed Reindeer|good           |1               |1939-05-01T00:00:00.000Z
The Nutcracker                |good           |1               |1816-12-06T00:00:00.000Z
Santa Claus                   |good           |1               |0270-12-25T00:00:00.000Z
Jack Skellington              |mixed          |2               |1993-10-13T00:00:00.000Z
Scrooge McDuck                |mixed          |2               |1947-12-09T00:00:00.000Z
The Grinch                    |bad            |3               |1966-12-01T00:00:00.000Z
Ebenezer Scrooge              |bad            |3               |1803-12-19T00:00:00.000Z

And this is what the second Painless query would look like in ES|QL — this one is a little trickier since it requires casting TO_DOUBLE() and turns out a bit longer. However, it should still be more understandable than writing this in Painless.

POST _query?format=txt
{
  "query": """
    FROM christmas_characters
    | EVAL numeric_timezone = CASE(
        timezone == "America/New_York", TO_DOUBLE(-5.0),
        timezone == "UTC", TO_DOUBLE(0.0),
        timezone == "Europe/London", TO_DOUBLE(0.0),
        timezone == "Asia/Kolkata", TO_DOUBLE(5.5),
        TO_DOUBLE(-0.1)
      )
    | SORT numeric_timezone DESC, date_of_birth DESC
    | KEEP name, behavior, numeric_timezone, timezone
    | LIMIT 10
  """
}

Again, following the timezones around the world.

             name             |   behavior    |numeric_timezone|     timezone      
------------------------------+---------------+----------------+-------------------
Scrooge McDuck                |mixed          |5.5             |Asia/Kolkata       
Ebenezer Scrooge              |bad            |0.0             |Europe/London      
Santa Claus                   |good           |0.0             |UTC                
Jack Skellington              |mixed          |-0.1            |America/Los_Angeles
Rudolph the Red-Nosed Reindeer|good           |-0.1            |America/Anchorage  
The Nutcracker                |good           |-0.1            |Europe/Moscow      
Buddy the Elf                 |good           |-5.0            |America/New_York   
Cindy Lou Who                 |good           |-5.0            |America/New_York   
Frosty the Snowman            |good           |-5.0            |America/New_York   
The Grinch                    |bad            |-5.0            |America/New_York

Conclusion

Now that all the sorting is done, off they go to deliver their presents.

5 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.