How to start a scroll deep into a result set?

andy.lowry · June 25, 2017, 2:28pm

I'm using ES as a back-end object store for an API I'm creating for a customer. ES is appropriate because the API includes some search features that benefit from ES's full text searching capabilities. But it's in most respects being used as an object store. And the client wants me to use ES for this because they're familiar with it.

Several of the API methods include pagination parameters: pageSize and pageNum.

I have two problems that don't appear to have good solutions in ES. (And by the way, my client is currently on v2.4 and wants this deployed there as well, but I haven't found much potential relief in 5.4).

First problem: Suppose a query comes in with (pageNum-1)*pageSize > 10000. (I know I can increase index page window size, but docs make that sound scary, and besides, I've got millions of records, so that's probably not going to fly).

There are only two options I can think of to handle such a request:

Return an error response. Not cool, and probably not acceptable to my client.
Use the scroll API to scroll retrieve and discard the first bunch of results, then continue to use that scroll to retrieve more records. (And re-use that scroll where possible for future requests).

#2 would be OK (though very wasteful) if I could max out the scroll size while in my skipping phase and then put it back to a reasonable value for my actual results-returning phase. But the scroll API doesn't appear to pay attention to the size'parameter, so my only choices appear to be VERY slow skip phase, using a small scroll size, or having an ES scroll size that doesn't match my API page size, making my code a lot more complicated than it ought to be.

Her's my second problem: Suppose I'm scrolling along just fine with a scroll size of, say, 100, matching the API pageSize parameter. And then suppose the client doesn't hit my service for a while, and by the time they ask for the next page my ES scroll ID has expired. Again, I could send an error response, but that would suck.

In this case there's a 5.4 feature that could be useful - Search After. I'd need to make sure all my requests were sorted, and that sort keys were unique across all records, and then I'd need to remember the last sort value reported in any results returned by the now-defunct scroll. That way I could specify that sort value with search_after in a new query that's otherwise identical to the first, and then continue as normal. There'd be a bit of bookkeeping on the back end, but probably not too grotesque.

But alas that does not exist in 2.4. And besides, it does nothing for my first problem, since i'll have no way to know what sort value to start with in that case.

So bottom line, my question is: am I missing the "right" way to do this, preferably one that will work in 2.4?

If the answer is "no, you're pretty much stuck where you think you are," then can I make a couple feature requests?

Support the case where from+size > result window size, as long as size does not, by itself, exceed that size. The fact that I can programmatically use the scroll API to scroll deep into my data means that the server could do it too. Just do the same retrieve/discard loop on the server that I can do on the wire, without wasting all the bandwidth and client-side processing that's currently required.
Honor certain benign changes to the query in the scroll API, e.g. size and _source parameters (former to allow resumption of desired scrolling after doing a "catch up" set of large scrolls, latter to avoid sending most of the data during such a "skipping" phase). (My use-case for this feature would vanish if #1 were added, but there are probably other use-cases, and it seems a perfectly natural feature. And I would use it if this were implemented but #1 were not.)

spinscale · June 26, 2017, 6:56am

Hey,

the correct way is indeed to use search_after, which is sadly not available in 2.x. You could increase the timeouts for a scroll search, but I'd highly recommend against that, as you will need to keep resources open for a longer time, potentially requiring more diskspace and memory until the scroll are closed.

--Alex

idanhagai · June 27, 2017, 9:05am

Any JAVA example / JAVA API documentation for searchAfter ?

andy.lowry · June 27, 2017, 11:20am

Too bad, but thanks for the response. As I mentioned, this won't help in the case where I've got a cold jump into the deep, and no clue as to what sort key values to set in search_after.

This strikes me as a pretty significant deficiency in a storage technology. Any support for the feature changes I've put forward?

spinscale · June 27, 2017, 12:42pm

see SearchRequestBuilder.searchAfter() or maybe the SearchAfterIT test in the core helps.

system · July 25, 2017, 12:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What about the Scroll API makes it a bad choice for paging large result sets? Elasticsearch	3	63	November 22, 2024
ES Pagination - best way Elasticsearch	3	1113	July 2, 2017
Using scroll and different results sizes Elasticsearch	1	365	July 6, 2017
How to scroll back in Elasticsearch Search Scroll API Elasticsearch	6	2438	May 7, 2020
Scroll vs Search API Elasticsearch	7	10821	July 5, 2017

How to start a scroll deep into a result set?

Related topics