Paginating result set greater than 10000 (with aggregations) - Possible options


#1

My use case in this regard is pretty simple, I am running a query, with aggregations, which returns result > 10000 documents. For navigating this result set I am using from and retrieving documents in batches of 20.

Everything runs fine except for pagination doesn't work beyond page 500 with the following message (trimmed) appearing in logs,

Result window is too large, from + size must be less than or equal to: [10000] but was [10020].
See the scroll api for a more efficient way to request large data sets.
This limit can be set by changing the [index.max_result_window] index level setting.

The message is pretty clear as to what went wrong and what needs to be done. However neither of the stated solution seems to be applicable for my use case,

  1. Scroll API - As far as I have read, with scroll api getting back aggregations for subsequent scrolls is not possible.

    If the request specifies aggregations, only the initial search response will contain the aggregations results.

  2. Changing the index.max_result_window looks more of a temporary hack with a good possibility of putting extra strain on resources.

Search After on the other hand looks like a viable solution, however, I am not really sure if this is the best way to go or there's a better way out to get around this issue.

Thanks for the help!


#2

Bump. Anyone?


(David Pilato) #3

Do you really have users who would like to go through 500 pages before finding what they are looking for?

Could you explain a bit more the use case?

I mean that on Google or Qwant, I rarely click on page 2 or 3. No way I have to go to page 500 to find less relevant results. Most of the time, this problem should be solved in a different way.

with scroll api getting back aggregations for subsequent scrolls is not possible.

Scroll API is meant to extract tons of data to be consumed in another tool (think about CSV export to Excel). In which case you probably don't need to get again and again the same agg result.

If you have no choice, indeed search after might be a good choice. But keep in mind that your user can search after but not really before. Which means that if the user wants to go to page 502, he won't really be able to go back to 501.

So again, what is the use case? May be you can solve the end user problem in a different way?


#4

Do you really have users who would like to go through 500 pages before finding what they are looking for?

No, not even a single user.

Could you explain a bit more the use case?

The entire data-set is organized in different categories viz. Cat A, Cat B, Cat C, etc. And the user is allowed to browse by category, wherein a category can have more than 10,000 documents in it (each one being unique, think of products in e-commerce). The problem arises paginating in those huge categories.

Although, sub-categories are indeed available for easier navigation but I am trying to cover up the use case wherein a user does take the pain of navigating to 501th page!

I mean that on Google or Qwant, I rarely click on page 2 or 3. No way I have to go to page 500 to find less relevant results. Most of the time, this problem should be solved in a different way.

Searching isn't really involved in this case, instead its more about browsing what's already indexed. And like you said, I might be thinking in the wrong direction.

Any input will be appreciated :slight_smile:


(David Pilato) #5

I am trying to cover up the use case wherein a user does take the pain of navigating to 501th page!

I see. So no search here. You are basically searching with a match_all query and you only have a filter by category, right?

I believe you have few choices here:

  • Blocking the user after page 500 explaining that he should refine its search.
  • Increasing the default value index.max_result_window but as you know already this comes with a cost. Also the deeper you go, the slower it will be
  • Unless you use search_after feature where elasticsearch can do some optimizations but you can't really go back IMO.
  • scroll might be something to consider but it has some drawbacks: keeping segments around until all scrolls has been released, if the user take too much time on a page before scrolling again he will loose the scroll id and will restart, aggs: you need to keep them somewhere in your app layer, going back is not allowed either.

Redesigning the result page might be a better choice IMO but I don't know if it's doable in your context. I like the faceted navigation pattern which is used everywhere to help the user finding the best data for him without having to go 100s of pages.

My 0.05 cents


#6

You are basically searching with a match_all query and you only have a filter by category, right?

Correct.

Blocking the user after page 500 explaining that he should refine its search.

This is what I have thought about doing but needed some clarity in the form of available choices.

Increasing the default value index.max_result_window but as you know already this comes with a cost. Also the deeper you go, the slower it will be.

A couple of questions on this,

  1. If, let's say, the default value is increased to 100000 however users never really goes beyond page 3 or 4. Meaning, there's enough capacity to serve that one odd request however in general, the limit isn't tested. In that case, how does this increased limit affect the query time? Is there an adverse effect on existing queries? Should I be expected any slow downs? If there is more theory to how it works, please do share the link!
  2. Can this be increase on run time viz. if page > 500 is requested, temporarily bump up this limit to say 10020 and once the request is served, reset to default value?
    PUT _all/_settings?preserve_existing=true'
    {
      "index.max_result_window" : "10020"
    }
    

(David Pilato) #7
  1. I believe it will be ok if users keep it reasonable. But someone can suddenly blow up your memory by sending like 10 requests at the same time (think about hitting the refresh button).
  2. Yeah. It's a dynamic setting. See https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#dynamic-index-settings

Note what is written in doc about this:

The maximum value of from + size for searches to this index. Defaults to 10000. Search requests take heap memory and time proportional to from + size and this limits that memory. See Scroll or Search After for a more efficient alternative to raising this.


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.