I have a query that is actually twice as fast when tracking total hits (setting track_total_hits: true or adding an aggregation), than as running it with track_total_hits:false - which is contradictory to the idea of block max WAND to me
Setting track_total_hits: false this query runs 400ms, where is with track_total_hits: true or adding an aggregation it runs in less than 200ms.
I cannot share the dataset or the full query. The query is basically a bool query consisting of a single must clause with several should clauses and three filters.
The total number of results matching the query is about 11k in an index with double-digit millions of documents.
This is the profiler output with track_total_hits: false
yes, exactly the same index, same warmup, I can change between track_total_hits: true or track_total_hits: false arbitrarily and the response time is consistent, so no warmup issue.
Unfortunately, queries that have a mix of MUST and SHOULD clauses are not great at dynamically pruning hits today, the best that we do at the moment is to mark (the disjunction of) SHOULD clauses as required when the minimum competitive score is greater than the maximum score of the MUST clauses. Lucene focuses more on queries that have only SHOULD clauses or only MUST clauses (plus FILTER/MUST_NOT clauses possibly, which are easier to handle since they don't contribute scores).
Interestingly, this kind of query is not covered at all by literature on query evaluation, though it feels like we could generalize the way that dynamic pruning works on disjunctions to such queries. I'm curious to better understand why you end up with a mix of MUST and SHOULD clauses. Presumably, you have some user query that you put in a MUST clause, and then you add some features in SHOULD clauses to allow them to influence scoring? Is that a fair assumption?
your assumption is 100% correct. We have one part that is a must clause and all the should clauses (around half a dozen, soon to be more probably) influence the scoring in different required ways - term queries, range queries, function scores with field value factors - which I thought to replace with rank feature, but given this issue I am not sure how much sense it makes. Do you think that would make a difference?
I had not considered this kind of query to be special. The way you describe it, it's not my query in particular but the combination of must/should. Is there anything I could optimize on the query to improve the behaviour or give Lucene/ES any hints about it?
Rereading your post you wrote that the query is not great at dynamically pruning hits. However it's not only not great, but actually worse than the tracking of total hits, and that's what surprises me most. I'd be totally fine with equal performance in this case.
You are correct, it wouldn't help in the current state.
Is there anything I could optimize on the query to improve the behaviour or give Lucene/ES any hints about it?
Not much... The only idea that comes to mind is trying to move some SHOULD clauses to MUST clauses if they match all documents... which may not apply to you.
Rereading your post you wrote that the query is not great at dynamically pruning hits. However it's not only not great, but actually worse than the tracking of total hits, and that's what surprises me most. I'd be totally fine with equal performance in this case.
You are right, it's very disappointing. We've been working hard on fixing it for pure disjunctions (we used to have a similar issue where enabling hit counting could make these queries faster, typically when there were many high-frequency clauses). We should better look into these queries as well. Typically the problem is that we do more work for dynamic pruning but don't actually save evaluating docs.
We should look deeper into these queries that mix SHOULD/MUST clauses now. In case you're interested in giving it a try, a starting point would be to make WANDScorer accept required clauses in addition to optional clauses and find a way to make it work with it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.