Filter on rank_features

Hi all

Back in 2017/2018 I attended Elasticon in San Francisco and got to meet a bunch of the engineers there. It was a really good experience and I got to have a lot of questions answered.

One in particular was bugging me: why didn't Elasticsearch take better advantage of the "early termination" query execution approach that relational databases avail themselves of to deliver results filtered and ordered on indexed fields in roughly log time (to find first hit) plus linear time (to find size hits.) Queries covered by this approach in relational databases include queries like select * from foo where a > ? order by a limit 10

Even with this straightforward way of executing a relational database query, relational databases would not have been enough for my application because I could not index all of my sorting criteria as I wanted an arbitrary linear combination of multiple fields from among tens of thousands of fields.

(Edit: "this can be done" should be "something similar to this can be done.")

In theory this can be done with any linear combination of any monotonic functions of any number of fields that have been indexed individually ahead of time. The fraction of documents or rows that must be visited in order to obtain your results varies based on the multivariate distribution of your data in the space defined by your query. In relational terms if you wanted select * from t where a+b > x order by a+b then ideal performance will be achieved if a and b are totally correlated, and worst case performance will be achieved when totally anti-correlated. More filter criteria can be included, and this will eat into the performance benefits of the particular approach that I had in mind.

I made my case to some of the engineers there for something like this in Elasticsearch, and some time later Elasticsearch introduced new vector types, which seem to have been replaced now by rank_features. Eventually, a linear mode became available.

I've been very excited for some time to take advantage of these new features.

Imagine my surprise, then, when I found that I could not apply common range queries to data indexed into rank_features fields.

I am more than a little :anatomical_heart: :bomb: heartbroken this Valentine's day to come up against this barrier.

Is there a way to take advantage of rank_features so that I can reduce the size of my mapping by tens of thousands of fields and reduce my query latency when ordering on linear combinations of these fields, while retaining the ability to filter on these fields with range queries?

Thanks and Happy Valentine's day!

I could not apply common range queries to data indexed into rank_features fields

Sorry to hear about your heartbroken experience with rank_features fields.

Indeed, currently there is no way to do range queries on rank_features. rank_features are not numeric fields, and encoded as terms with frequencies. Doing a range query on it would be slow, as it would involve an exhaustive iteration over all documents with this field. That's why we opted out not to implement it. That's said, it may be worth submitting a github feature request for this.

Another thing is that rank_features were not designed for a case to substitute multiple numeric fields. There were designed for a search ranking case to enhance search textual relevancy with numeric ranking features. That's why filtering based on range doesn't really suit the design of this field type.

For reducing the size of mapping, we have an idea of flattened numeric field, but this is not actively worked on.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.