My understanding is bitsets play more of a role when you are looking over multiple filters. But think in term mapping. It is possible to construct a reverse mapping between maybe value => items that contain that value. I'm not too familiar with ES or Lucene internals but if they use this inverted index setup, that could be one reason why the postive query (which inverted index are great for) can be quicker then negative.
But @teamamerica, as far as I'm aware, a bitset will have 1s and 0s for whether the filter matches. So would it not be a quick operation to not that bitset logically and get a list of documents for which the filter doesn't match?
But @Christian_Dahlqvist, I can select the same 29226733 documents from the must_not query by using all the other possible values for that field in the must query and that consistently takes around 90ms. So the number of documents returned doesn't seem to explain the difference.
must_not is translated into a not filter surrounding a must filter.
And/or/not filters are not translated into bitsets, see
What you see is Elasticsearch traversing the posting lists (all values of the field) to look for those field values that does not match. This takes time proportional to the cardinality of the field. Especially on high cardinalty fields, must_not is a challenge.
So essentially if you want fast must_nots it would probably be quicker to do that noting in application code and query elasticsearch with a must with a list of all possible values excluding the ones in must_nots?
It doesn't look like must_not translates it "must": {"not":{}}. must_not is natively part of Lucene's BoleanQuery. The only thing I see is that Elasticsearch adds a match_all query to the must if the query is only must_not clauses. Presumably because BooleanQuery will find no results if its just must_not clauses. I'd expect match_all to be quick though. How fast as a simple match_all query?
Do you have anything else in your query except the bool?
@nik9000 that's interesting, the match_all thing makes sense in order to get results back in the absence of any other filter. The queries I've posted are the exact ones I'm using. The field is non_analyzed, would you be able to replicate it on your cluster I wonder?
I just tried with no luck - the one where I listed all the values took a bit longer. Both took just about the same amount of time as a match_all. But I'm on 1.3 in production so it might be a new thing.
must_not does indeed not translate to "must": {"not":{}}.
If you run a bool query with must/should and must_not clauses, Lucene will first create an iterator that matches the must/should clauses, and then if you have must_not clauses, this iterator will be wrapped in order to exclude documents that match any of these must_not clauses.
Let me take an example: you have 1M documents in your index, and 1000 of them contain bar in the foo field. If you want to find all documents that match foo:bar, Lucene will just iterate over the postings list of foo:bar and call the collector on it. So you would decode 1000 documents from your postings list and call the collector 1000 times. Now if you execute the same clause as a must_not filter and have a match_all query as a must clause, Lucene will iterate over all documents matching the match_all query, and for each of them check if they match foo:bar and should be excluded. So you have to check 1M times if the document matches foo:bar and call the collector 999000 times.
This is why when you have a boolean field, it is more efficient to encode true and false explicitely instead of only building an index for true and then searching for documents that have false as a value by searching for documents that don't have true.
So essentially if you want fast must_nots it would probably be quicker to do that noting in application code and query elasticsearch with a must with a list of all possible values excluding the ones in must_nots?
This depends on how many possible values you are. If there are only a handful of them then this could help, but if there are thousands of possible values, this would not be an option.
@jpountz, thanks for the explanation that's really really helpful.
We had about 13 unique values for this cs_grp field so having a list of the possible values sped us up a little. As our more permanent solution we have added an extra boolean field in order to handle the same logic. This allows us to must the field value as true or `false. But it does mean we have to do more at index time (and so a reindex for all our current data), but it should speed us up hugely.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.