Deep pagination vs scroll & Search After

Dears,

My use-case:

Export up to 1 million documents of size 5K each for EXCEL EXPORT with 5GB output. This is not for real-time users and will be used by 1 user at any point of time. Using Elastic 2.4:

a) Use deep-pagination up to 20 K limit and allow user to keep changing the range to export all data: This option will use 20 K * 5 Shards = 100 K documents * 5 K each = 500 MB memory is used at any point in time when each time a page is visited. After a new search is made with the next batch of 20 K, the memory being used previously will be garbage collected. So at any point of time, 500 MB is used in JVM.

b) Use Scrolling API instead of deep pagination and keep scrolling: How much memory will be used with each slice? What is the cost of this operation to the CPU and JVM? Hom much JVM space is used at any point in time? Sorting/Aggreagtion is not enabled, but this can be done on the excel.

c) Use Search After instead of deep pagination: How much memory will be used with each slice? What is the cost of this operation to the CPU and JVM? Hom much JVM space is used at any point in time?

Which one(s) suits best with their pros/cons?

Thanks in advance...

1 Like

Your estimation of the memory usage of deep pagination is not correct. When you ask Elasticsearch for the 100th page, it still needs to retrieve all results from the previous 99 pages since it can't predict how documents that come from different shards compare to each other.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.