Does “from” parameter in ElasticSearch Impact the ElasticSearch Cluster?

I have a large number of documents(around 34719074 documents) in a type of an index(ES 2.4.4). While searching, my ES Cluster seems to be in high impact(Search Latency, CPU Usage, JVM Memory and Load Average) when the "from" parameter is high(greater than 100000, "size" parameter being constant). Any specific reason for it? My query looks like:

 {
         "explain": false,
         "size": 100,
          "from": <>,
            "_source": {
                "excludes": [],
                "includes": [
                    <around 850 fields> 
                ]
            },
            "sort": [
            <sorting from an string field>
            ]
}

Because Elasticsearch needs to page through 100000 results to get the the next 100. This requires resources. You can't get around that if you are paging that far.

But why are you even paging to that many results?

How come you are including 850 fields? Why not return the full document? How many fields do you have?

1 Like

Because our application uses pagination facility and we deal with lots of data. Page and PageSize facility is there in our application. For eg: if user browse Page=10000 and each PageSize is set to 100, the our application hits the elasticquery of:

"size":100.
"from": "999900"

We are using Elasticsearch since 5+ years and is using the same old features for this part(though we managed to upgrade ES to 2.4.4 from 1.*). Meanwhile, we are planning for some upgrade, so any suggestion for this specific part?

Why do your users need to page through to 10000 pages though?

Row(which is stored as a document in ES) are sorted according to a certain fields and since there are millions of data, for eg: if user want to see middle rows they have to browse to deep page.

Actually, there are around 900 fields, 50 of which are not required to be returned. Does defining fields explicitly also hampers the performance?

If you need to filter out fields, then yes, there is a cost.

It means that each source document needs to be parsed rather than just returned, which results in higher CPU usage.

Just a comment here:

if user want to see middle rows they have to browse to deep page.

The question is often: why a user would need to do that.
I'm often taking the example of google... I never ever click on page 1000 to see the results on that page. Because I'm expecting the search engine to give me the most important information on page 1 or 2...

That's really something you should think about. It's not about technical details but usability. If a user has to go randomly to page 1000 to see a record, then something looks wrong to me in term of design.

I prefer adding graphical representation of aggregations where the user can visualize the distribution of the data on different fields like date, price average, ... And then be able to filter the resultset by clicking on those graphics. Faceted navigation that is.
Or help the user by sorting on different keys.
Or worse case, help the user to export all data locally where he will be able to read millions of records if he really needs to. For this, you use the Scroll API.

Just my thoughts here.

1 Like

If you don't need random access to pages within the result set (e.g skip from page 1 to page 500) then consider the search_after parameter for deep paging.

Thanks for the suggestion. So, upto how much "from" would you suggest?

I wish that no one has to ever go to next page...
But we have a default limit of 10000 (means 1000 pages). This is already a high value IMO.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.