I have been playing around with index sorting and cannot make sense of my results. Basically I have a keyword field named is_existing that is either true or false as possible values, or may not be set at all. Basically every search has a is_existing: true filter.
My assumption here was that by using index sorting on this field, I would be able to speed up all those searches, because they are touching less than 40% of the data, meaning we could basically skip 60% of the data with every search.
What I am seeing however after reindexing the data with index sorting enabled is, that searches are actually slower than without index sorting (i.e. a 70ms search now takes 85ms).
I have already played around with the sort order and putting missing fields first/last and there does not seem to be a difference. This is on Elasticsearch 8.18.
@spinscale that’s interesting. Can you share an example of the mapping and query that you are seeing this behavior on. I can probably take a look at it and give you some suggestions or dig into the code a bit at the very least to better explain what’s going on.
I also tested when changing the sort.order to asc with no difference. My base assumption here was that true should be stored before false and before missing, so that Lucene only has to scan the first part of a segment.
In such a scenario I might consider putting the data into different indices depending on value of that field. If, as you state, the queries are almost always for a specific value.
What I was hoping to get out of an example was to validate a couple of things. Sorting will only show a performance improvement when both the index and the query are sorted. It’s not a general purpose filtering mechanism.
So a modified example from the docs is something like this:
Are you setting up the search request appropriately? Note that track_total_hits set to false is necessary so we aren’t looking to count everything that matches outside of the first 10 in this example.
Also just because it pings in my head. Have you considered using a boolean field there instead of a keyword it wouldn’t surprise me if that’s much more efficient (but to be fair that’s a completely separate optimization).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.