I want to understand more about how Compound Queries perform. To help frame the problem, my index has 100M records. We have nearly 1M users, who on average have access to about 100 records. So only 0.001% of the data is accessible to each user.
I'm using the filter portion of the bool compound query to filter out all the records a user doesn't have access to. So this filter reduces the set of responses by several orders of magnitude. What I'm curious about is in the must/should/must_not portions of the bool query am I able to use queries that would normally be consider nonoptimal (such as wildcard)?
I think it boils down to a simple question, do all parts of the compound query look at the entire indexed field(s), then intersect the results? Or is there some optimization that comes from first filtering down the dataset, then applying less optimal queries?
Things like query term expansion do not consider accessibility of the terms chosen as that would add further delay.
Security filters tend to be applied only to the stream of matching docs once a query has been constructed.
No one blog I'm aware of but I'm sure they're out there. These docs talk about caching filters.
Essentially each query clause offers an iterator reading from a sorted list of doc Ids on disk. Abool query will advance each iterator's cursor to score whatever it sees as the "current" doc. The bool query knows some clauses are "must" type clauses and so can use their largest current-doc cursor to request a "skip to.." operation on all other clauses - there's no point in iterating over docs one 1 to 1 million when a must clause tells you the next viable match is doc #1,000,00. Skips are efficient because on-disk, long lists of doc-ids for a term are also interspersed with "skip" pointers for efficiently leaping forward in the list.
Another recent performance enhancement in Lucene is that weak "should" clauses can be dropped from execution entirely if it is known that any match on that clause will still not score a doc highly enough to beat the current best matches. This optimisation is yet to feature in elasticsearch as it requires the client to acknowledge that the "total-hits" count returned would be a low-bound rather than an exact figure. (Check out the big gains to be had)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.