Hi all
Back in 2017/2018 I attended Elasticon in San Francisco and got to meet a bunch of the engineers there. It was a really good experience and I got to have a lot of questions answered.
One in particular was bugging me: why didn't Elasticsearch take better advantage of the "early termination" query execution approach that relational databases avail themselves of to deliver results filtered and ordered on indexed fields in roughly log time (to find first hit) plus linear time (to find size
hits.) Queries covered by this approach in relational databases include queries like select * from foo where a > ? order by a limit 10
Even with this straightforward way of executing a relational database query, relational databases would not have been enough for my application because I could not index all of my sorting criteria as I wanted an arbitrary linear combination of multiple fields from among tens of thousands of fields.
(Edit: "this can be done" should be "something similar to this can be done.")
In theory this can be done with any linear combination of any monotonic functions of any number of fields that have been indexed individually ahead of time. The fraction of documents or rows that must be visited in order to obtain your results varies based on the multivariate distribution of your data in the space defined by your query. In relational terms if you wanted select * from t where a+b > x order by a+b
then ideal performance will be achieved if a
and b
are totally correlated, and worst case performance will be achieved when totally anti-correlated. More filter criteria can be included, and this will eat into the performance benefits of the particular approach that I had in mind.
I made my case to some of the engineers there for something like this in Elasticsearch, and some time later Elasticsearch introduced new vector types, which seem to have been replaced now by rank_features
. Eventually, a linear
mode became available.
I've been very excited for some time to take advantage of these new features.
Imagine my surprise, then, when I found that I could not apply common range
queries to data indexed into rank_features
fields.
I am more than a little heartbroken this Valentine's day to come up against this barrier.
Is there a way to take advantage of rank_features
so that I can reduce the size of my mapping by tens of thousands of fields and reduce my query latency when ordering on linear combinations of these fields, while retaining the ability to filter on these fields with range
queries?
Thanks and Happy Valentine's day!