After upgrading to ES 2.1 I've noticed that the max_result_window now defaults to 10000 and throws an exception if it's exceeded. I understand the reasons behind this and was wondering if there is any other way to perform the following (without increasing the max_result_window setting);
Perform a search across a set of 100k documents, with the results being paged. This is done using the 'from' and 'size' query settings.
Click on any of the pages, with the results for that page being displayed. As it stands now this will throw an exception if the page exceeds the default result window.
The search needs to be ordered so using scan/scroll seems to be out of the question.
As David says you can scroll without scan. Other than. That you'd have to
raise the setting. Or you could prevent such deep scrolling in your
application. I've done that in the past.
I don't know of anything else you could do as it stands now.
I suspect it's technically possible to build an "after" style clause that'd
find you results who's scores are lower than some point. It wouldn't be
100% accurate because the data changes but its something.
Nevermind, I found the reason.
Now, I am wondering if there is any effective way to paginate a large number of docs.
Would scan & scroll make a good pagination feature?
Deep Paging in Distributed Systems
To understand why deep paging is problematic, let’s imagine that we are searching within a single index with five primary shards. When we request the first page of results (results 1 to 10), each shard produces its own top 10 results and returns them to the coordinating node, which then sorts all 50 results in order to select the overall top 10.
Now imagine that we ask for page 1,000—results 10,001 to 10,010. Everything works in the same way except that each shard has to produce its top 10,010 results. The coordinating node then sorts through all 50,050 results and discards 50,040 of them!
You can see that, in a distributed system, the cost of sorting results grows exponentially the deeper we page. There is a good reason that web search engines don’t return more than 1,000 results for any query.
Hi,
So with max_result_window default to 10K, does it mean i can not run query like
url?from=10000&size=200 on my indices.
For that i need to manually change the setting and do it?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.