How to implement pagination with large dataset

Environment

.Net 5
Elasticsearch.Net.Aws 7.1.0 : NuGet Gallery | Elasticsearch.Net.Aws 7.1.0
Low level client

Problem

Even with pagination, Elasticsearch's query API does not support more than 10_000 records by default. I.e. if the sum of from and size > 10_000 the API throws an error.

Potential solutions

Increase size


I can increase the index's max_result_window as described here. However I am expecting a large dataset in production - probably less than 10_000_000 records at one time, but for obvious reasons I don't believe that simply increasing the window size is a good idea. My use-case does not require over-the-top performance, but it has to be reasonable for both the end-user and the AWS bill.

What do you think? What leeway do I have regarding to max_result_window setting?

Track total hits


I've read about track_total_hits parameter - It only returns the correct amount of total hits on each request, but still does not allow records after the 10_000th to be fetched

Scroll API


I've read about the Scroll-API - it's being deprecated currently, so I'd like to avoid it.

Search after


I've read about the search_after parameter - the concept is to define a consistent sort criteria and call exact query for each page, the only difference being is the value of search_after, which for every subsequent search should be the sort value returned of the last hit in the previous search.

As far as I can tell this is the recommended solution, but while it may work for large page sizes, I'm having difficulty understanding how it solves the basic paging case:

Lets say we have 20_000 records total, page size is 10, hense 2_000 pages. How can I return the last page, containing records 19_990-20_000? Unless I misunderstand, search_after does not help, because I've skipped pages and I don't have the sort value of record number 19_989.

Further more, per the docs:

If provided, the from argument must be 0 (default) or -1

This means that I cannot use a combination of both:

  1. Perform one search with "from": "990"
  2. Use the last record's sort value to perform a second search, again using a "from": "990"
  3. Return the results of the second search.

Beyond that I cannot figure out another way to use it. Could you tell me where I'm getting it wrong?

Hi @achobanov.

The package you are using is not an official Elastic package so we can offer no support for it. It depends on one of our libraries but I'm not sure what APIs it exposes.

For a supported client, you would need to use NEST (for v7). We have a new client in development for v8. NEST provides high-level APIs for working with Elasticsearch. In your scenario, it sounds like you're after the Scroll API functionality. Which version of Elasticsearch are you using? If you're on the latest release of 7.x, you can also leverage the point in time API and "search after" functionality as well. These are useful for paginating through a set of results beginning to end though.

Hi @stevejgordon,

I'm sorry, I've posted the AWS provider package. It's not really relevant to our discussion, however, because it only configures connection to AWS, nothing else. I'm using the low level client instead of NEST, so I am building the search body string myself.

I've explored the options you propose, and have described in my question my thoughts about them. In short:

  • Scoll API seems to be deprecated, so I'd prefer another solution
  • I don't unsterstand how search_after can help me. I.e how can I use it to retrieve the last 10 documents from a 20_000 size dataset. Please look at the relevant section in my question above for more info.
  • Increasing the index max_result_window seems like the easiest, but potentially problematic solution. Can you offer any guidelines as to how much I can reasonably increase it?

My use-cases are simple queries - the most complicated theoretical of now is bool query, compounding up to 5 other bool queries, each consisting of up to 4 match_phrase queries. In reality I doubt it will often be used for more than a total of 5-6 queries compounded in a single bool.

Highest performance is also not sought after.

These search APIs are designed to let you paginate through a large result set from beginning to end. Point in time replaces scroll as the recommended approach if your server version supports it.

If you want the last X docs, you shouldn't need to scroll through all results. If you can sort them in descending order, you can then request size for the number of docs you want in the response.

Steve, I guess I can do that, but what if I have 30_000 records and I want to get records 15_000-15_010? Maybe I'm grilling you too much, because I wouldn't imagine a user manually scrolling through 150 pages. I guess that's fine for now. Thanks for your time :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.