Scroll replay doesn't work due to constant scroll id

Hi,

Basically, I am using scan-and-scroll to pull some metadata from documents (fields which are light-weight E.g., Name, Hash, RegistrationNumber and these fields are mapped as 'stored = true').

Scenario: say while fetching result, the client got crashed/disconnect due to any valid reason or before processing the set of result application crashed. Now, I couldn't find any way to re-fetch the documents which were fetched from last call but couldn't reach the user/processed.

-> Given that scroll-id remains same throughout the scrolling, there is no option to fall back to last set fetched.

-> Impact: some of the documents silently go missing without any fall back. Especially when most of the set is fetched and only few sets are left (say 4Million docs are fetched out of 4.1 million).

-> It would have been great if user can fetch at-least the last set being tried to fetch, that way there will be a surety if last batch reached the user successfully or we should move on. Before validating this behavior, I was expecting a unique scroll-id every time which I can pass to ElasticSearch to confirm response was fetched successfully over the wire and now I am ready for next set of response. And in turn either ES will entertain me with past requested data, or will throw an exception for not having the right token as it was already been used.

One more question, does it impact the time to scroll if the document's size is large? Even though the fields requested are very light weight? I am seeing a performance degradation for same fields fetched from two mappings, out of which one has a bit extra heavy fields (which are not tried to fetch).

Thanks,
Pankaj

Yep, this is a known limitation to the scroll API. There are two related issues open that are very close to what you suggested:

The root problem is that scrolls only know how to "go forward" internally, so there's no way to retry the last page. It's fundamentally a technical limitation to how things work right now, which is why any solution to this would need some retooling like the two above tickets suggest.

One more question, does it impact the time to scroll if the document's size is large?

I would not be surprised if this affected scroll times. The likely culprit is just increased time to fetch and decompress the document off disk, increased disk IO, slightly increased time spent parsing/serializing JSON, etc.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.