Does “from” parameter in ElasticSearch Impact the ElasticSearch Cluster?

Yashasvi_Raj_Pant · November 14, 2018, 7:20am

I have a large number of documents(around 34719074 documents) in a type of an index(ES 2.4.4). While searching, my ES Cluster seems to be in high impact(Search Latency, CPU Usage, JVM Memory and Load Average) when the "from" parameter is high(greater than 100000, "size" parameter being constant). Any specific reason for it? My query looks like:

 {
         "explain": false,
         "size": 100,
          "from": <>,
            "_source": {
                "excludes": [],
                "includes": [
                    <around 850 fields> 
                ]
            },
            "sort": [
            <sorting from an string field>
            ]
}

warkolm · November 14, 2018, 7:48am

Because Elasticsearch needs to page through 100000 results to get the the next 100. This requires resources. You can't get around that if you are paging that far.

But why are you even paging to that many results?

Christian_Dahlqvist · November 14, 2018, 7:52am

How come you are including 850 fields? Why not return the full document? How many fields do you have?

Yashasvi_Raj_Pant · November 14, 2018, 8:25am

Because our application uses pagination facility and we deal with lots of data. Page and PageSize facility is there in our application. For eg: if user browse Page=10000 and each PageSize is set to 100, the our application hits the elasticquery of:

"size":100.
"from": "999900"

We are using Elasticsearch since 5+ years and is using the same old features for this part(though we managed to upgrade ES to 2.4.4 from 1.*). Meanwhile, we are planning for some upgrade, so any suggestion for this specific part?

warkolm · November 14, 2018, 8:30am

Why do your users need to page through to 10000 pages though?

Yashasvi_Raj_Pant · November 14, 2018, 8:36am

Row(which is stored as a document in ES) are sorted according to a certain fields and since there are millions of data, for eg: if user want to see middle rows they have to browse to deep page.

Yashasvi_Raj_Pant · November 14, 2018, 8:44am

Actually, there are around 900 fields, 50 of which are not required to be returned. Does defining fields explicitly also hampers the performance?

warkolm · November 14, 2018, 8:54am

If you need to filter out fields, then yes, there is a cost.

Christian_Dahlqvist · November 14, 2018, 8:57am

It means that each source document needs to be parsed rather than just returned, which results in higher CPU usage.

dadoonet · November 14, 2018, 9:38am

Just a comment here:

if user want to see middle rows they have to browse to deep page.

The question is often: why a user would need to do that.
I'm often taking the example of google... I never ever click on page 1000 to see the results on that page. Because I'm expecting the search engine to give me the most important information on page 1 or 2...

That's really something you should think about. It's not about technical details but usability. If a user has to go randomly to page 1000 to see a record, then something looks wrong to me in term of design.

I prefer adding graphical representation of aggregations where the user can visualize the distribution of the data on different fields like date, price average, ... And then be able to filter the resultset by clicking on those graphics. Faceted navigation that is.
Or help the user by sorting on different keys.
Or worse case, help the user to export all data locally where he will be able to read millions of records if he really needs to. For this, you use the Scroll API.

Just my thoughts here.

Mark_Harwood · November 14, 2018, 10:04am

If you don't need random access to pages within the result set (e.g skip from page 1 to page 500) then consider the search_after parameter for deep paging.

Yashasvi_Raj_Pant · November 14, 2018, 11:12am

Thanks for the suggestion. So, upto how much "from" would you suggest?

dadoonet · November 14, 2018, 11:34am

I wish that no one has to ever go to next page...
But we have a default limit of 10000 (means 1000 pages). This is already a high value IMO.

system · December 12, 2018, 11:34am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES v1.7.1 Returning almost similar results on every page Elasticsearch	2	380	July 5, 2017
Need Clarification Regarding Size Parameter in ES Query Elasticsearch	4	538	August 12, 2021
Query seems to ignore size / from parameters Elasticsearch	6	1854	July 5, 2017
Pagination Elastic Search Elasticsearch	8	640	March 12, 2019
Scan/scroll - optimal "size" parameter Elasticsearch	2	1103	July 5, 2017

Does “from” parameter in ElasticSearch Impact the ElasticSearch Cluster?

Related topics