Understanding Performance of Compound Query

workmanw · October 14, 2018, 3:02am

Hello,

I want to understand more about how Compound Queries perform. To help frame the problem, my index has 100M records. We have nearly 1M users, who on average have access to about 100 records. So only 0.001% of the data is accessible to each user.

I'm using the filter portion of the bool compound query to filter out all the records a user doesn't have access to. So this filter reduces the set of responses by several orders of magnitude. What I'm curious about is in the must/should/must_not portions of the bool query am I able to use queries that would normally be consider nonoptimal (such as wildcard)?

I think it boils down to a simple question, do all parts of the compound query look at the entire indexed field(s), then intersect the results? Or is there some optimization that comes from first filtering down the dataset, then applying less optimal queries?

Mark_Harwood · October 14, 2018, 8:39am

Things like query term expansion do not consider accessibility of the terms chosen as that would add further delay.
Security filters tend to be applied only to the stream of matching docs once a query has been constructed.

workmanw · October 14, 2018, 12:18pm

Would you be able to elaborate more? Or is there a good resource you know of for learning about how compound queries are executed?

Mark_Harwood · October 15, 2018, 8:31am

No one blog I'm aware of but I'm sure they're out there. These docs talk about caching filters.

Essentially each query clause offers an iterator reading from a sorted list of doc Ids on disk. Abool query will advance each iterator's cursor to score whatever it sees as the "current" doc. The bool query knows some clauses are "must" type clauses and so can use their largest current-doc cursor to request a "skip to.." operation on all other clauses - there's no point in iterating over docs one 1 to 1 million when a must clause tells you the next viable match is doc #1,000,00. Skips are efficient because on-disk, long lists of doc-ids for a term are also interspersed with "skip" pointers for efficiently leaping forward in the list.

Another recent performance enhancement in Lucene is that weak "should" clauses can be dropped from execution entirely if it is known that any match on that clause will still not score a doc highly enough to beat the current best matches. This optimisation is yet to feature in elasticsearch as it requires the client to acknowledge that the "total-hits" count returned would be a low-bound rather than an exact figure. (Check out the big gains to be had)

workmanw · October 16, 2018, 1:45pm

Thanks a lot for taking the time to explain that!

system · November 13, 2018, 1:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch query performance using filter query Elasticsearch	4	4002	December 29, 2017
Algorithmic complexity of bool queries Elasticsearch	2	453	January 6, 2021
Filter vs bool Query Performance , getting bad performance on using Filters Elasticsearch	7	2645	June 6, 2018
Filter performance difference (Bool vs Terms) Elasticsearch	2	4263	July 5, 2017
Performance difference between searching with one field against with two fields Elasticsearch	6	1301	July 5, 2017

Understanding Performance of Compound Query

Related topics