When I request documents using the same sort order, matches come back extremely fast. For example, I am searching Reddit submissions with an API I created:
From what I know about binary search, searching on a sorted array is a O(log N) operation. However, sorting in the opposite direction should also be nearly as fast. Is this a bug with Lucene's implementation of their index?
From what I know about binary search, searching on a sorted array is a O(log N) operation. However, sorting in the opposite direction should also be nearly as fast. Is this a bug with Lucene's implementation of their index?
Sorting the result of a search is not a binary search. We collect all documents that match the query in the order of their appearance in the index and keep track of the N best ones with a priority queue. If the index sort matches the search sort we can early terminate the query when N documents are collected, otherwise all documents must be collected. The reverse index sort order is the worst case since it requires to change the top N every time we visit a new document. There was some discussions to optimize this case in Lucene though ([LUCENE-7482] Faster sorted index search for reverse order search - ASF JIRA) with some interesting ideas. The issue is quite old so it could be interesting to open a new one that focuses on reverse index sort.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.