quick question about index sorting on boolean fields. Basically we have an is_active field in our data, and most of customer facing our queries have an is_active: true filter applied.
Am I right to assume that index sorting will help here, as early termination will help when filtering by is_active: true, despite not sorting? This should result in a speed up from my current interpretation, but I would like to be sure.
If so, what would be the right order to configure. As booleans are stored as T and F and I am interested in T for the fast path, then I would need to have desc sorting, so we're hitting the true documents first and skip the false documents?
From my understanding (haven't perform any performance tests) is that the pre-sorting will group the documents based on that key values into files.
That's how presorting helps in searching. By ignoring files without that value you are searching for. That should drop down the documents going through significantly.
I think the lower cardinality of the presorting key is preferred.
Depending on how large your index is, I think boolean might not be fine enough.
A boolean presorting key would maximumly reduce your pool by half (assuming equal distribution).
Is there another key you would group by in addition to "is_active"?
I would probably put that key first then "is_active" second in the presorting order.
For example, if your documents are say rental cars. And "is_active" is whether the car is in service.
I would presort both fields in this order {["model","is_active"],["desc","desc"]}.
I believe this is preferred because your physical files would group say {"model":"GM"} together into file(s).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.