Index sorting on low cardinality fields

Hey,

I have been playing around with index sorting and cannot make sense of my results. Basically I have a keyword field named is_existing that is either true or false as possible values, or may not be set at all. Basically every search has a is_existing: true filter.

My assumption here was that by using index sorting on this field, I would be able to speed up all those searches, because they are touching less than 40% of the data, meaning we could basically skip 60% of the data with every search.

What I am seeing however after reindexing the data with index sorting enabled is, that searches are actually slower than without index sorting (i.e. a 70ms search now takes 85ms).

I have already played around with the sort order and putting missing fields first/last and there does not seem to be a difference. This is on Elasticsearch 8.18.

What am I missing here that could cause this?

Thanks for any pointers.

–Alex

@spinscale that’s interesting. Can you share an example of the mapping and query that you are seeing this behavior on. I can probably take a look at it and give you some suggestions or dig into the code a bit at the very least to better explain what’s going on.

Hey John,

thanks for replying!

Unfortunately I cannot share a minimal example, because this runs against a big live production dataset and the query itself is pretty complex.

Fast variant:

No index sorting. is_existing is of type keyword and contains only true or false or nothing as possible values.

Slower variant: Same mapping as above, but index sorting is configured like this:

index": {
      "sort.field": "is_existing",
      "sort.order": "desc",
      "sort.missing" : "_last"
}

I also tested when changing the sort.order to asc with no difference. My base assumption here was that true should be stored before false and before missing, so that Lucene only has to scan the first part of a segment.

Hope that helps!

–Alex

In such a scenario I might consider putting the data into different indices depending on value of that field. If, as you state, the queries are almost always for a specific value.

I think your assumptions are good about how this would work.

Docs would agree as well: Index sorting settings | Reference

What I was hoping to get out of an example was to validate a couple of things. Sorting will only show a performance improvement when both the index and the query are sorted. It’s not a general purpose filtering mechanism.

So a modified example from the docs is something like this:

PUT events
{
  "settings": {
    "index": {
      "sort.field": "is_existing",
      "sort.order": "desc"
    }
  },
  "mappings": {
    "properties": {
      "timestamp": {
        "type": "keyword"
      }
    }
  }
}

AND

GET /events/_search
{
  "size": 10,
  "sort": [
    { "is_existing": "desc" }
  ],
  "track_total_hits": false
}

Are you setting up the search request appropriately? Note that track_total_hits set to false is necessary so we aren’t looking to count everything that matches outside of the first 10 in this example.

Also just because it pings in my head. Have you considered using a boolean field there instead of a keyword it wouldn’t surprise me if that’s much more efficient (but to be fair that’s a completely separate optimization).