I have a large number of documents(around 34719074 documents) in a type of an index(ES 2.4.4). While searching, my ES Cluster seems to be in high impact(Search Latency, CPU Usage, JVM Memory and Load Average) when the "from" parameter is high(greater than 100000, "size" parameter being constant). Any specific reason for it? My query looks like:
Because Elasticsearch needs to page through 100000 results to get the the next 100. This requires resources. You can't get around that if you are paging that far.
Because our application uses pagination facility and we deal with lots of data. Page and PageSize facility is there in our application. For eg: if user browse Page=10000 and each PageSize is set to 100, the our application hits the elasticquery of:
"size":100.
"from": "999900"
We are using Elasticsearch since 5+ years and is using the same old features for this part(though we managed to upgrade ES to 2.4.4 from 1.*). Meanwhile, we are planning for some upgrade, so any suggestion for this specific part?
Row(which is stored as a document in ES) are sorted according to a certain fields and since there are millions of data, for eg: if user want to see middle rows they have to browse to deep page.
if user want to see middle rows they have to browse to deep page.
The question is often: why a user would need to do that.
I'm often taking the example of google... I never ever click on page 1000 to see the results on that page. Because I'm expecting the search engine to give me the most important information on page 1 or 2...
That's really something you should think about. It's not about technical details but usability. If a user has to go randomly to page 1000 to see a record, then something looks wrong to me in term of design.
I prefer adding graphical representation of aggregations where the user can visualize the distribution of the data on different fields like date, price average, ... And then be able to filter the resultset by clicking on those graphics. Faceted navigation that is.
Or help the user by sorting on different keys.
Or worse case, help the user to export all data locally where he will be able to read millions of records if he really needs to. For this, you use the Scroll API.
If you don't need random access to pages within the result set (e.g skip from page 1 to page 500) then consider the search_after parameter for deep paging.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.