Algorithmic complexity of bool queries

I've been searching all over the net to get an understanding of the algorithmic complexity / rough big-O of bool queries for Elasticsearch, but I have come up dry so I'm asking here. (Note: this is not a question about correct query syntax, I will use psuedo-code that basically matches with the dsl for expediency.)

My use case - I have a query that has essentially a series of ors (shoulds) and ands (musts). To simplify I'll create a toy example:

query name='name' and filter: (x=['<uuid_1>', ...] and y=['<uuid_2>', ...]) or z='[<uuid_3>', ...])
order by name desc

where x, y, and z are keywords and name is text, name is in a sort.field with sort.order. name may be absent from some queries where we just filter.

In my use case x, y, and z all could contain one or more of half-a-billion values i.e. cardinality is 500,000,000. The number of documents is 500,000,000 as well.

We expect in the worst case for each = relation to match ~1 million documents.

So my question is, how can I reason about the performance? If this was just filtering on x and nothing more I'd expect it to be an O(1) fetch of 1 million uuids and it would just return the top 50 docs out of 1 million for 1 page which is pretty straight forward.

But how will a bool query like this operate? 3 x O(1) fetches and then join? How does the join happen? Does it need re-iterate over all the hits in memory and combine based on the bool logic? I'm really not sure.

I appreciate any insight here,
thanks!

Adrien Grand has done a lot of great talks on Lucene internals. Here’s one: https://youtu.be/p51vIDWHWqk

His more recent work on Block max WAND is also of interest.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.