Scroll replay doesn't work due to constant scroll id

PANKAJ_SAINI · October 14, 2018, 12:34pm

Hi,

Basically, I am using scan-and-scroll to pull some metadata from documents (fields which are light-weight E.g., Name, Hash, RegistrationNumber and these fields are mapped as 'stored = true').

Scenario: say while fetching result, the client got crashed/disconnect due to any valid reason or before processing the set of result application crashed. Now, I couldn't find any way to re-fetch the documents which were fetched from last call but couldn't reach the user/processed.

-> Given that scroll-id remains same throughout the scrolling, there is no option to fall back to last set fetched.

-> Impact: some of the documents silently go missing without any fall back. Especially when most of the set is fetched and only few sets are left (say 4Million docs are fetched out of 4.1 million).

-> It would have been great if user can fetch at-least the last set being tried to fetch, that way there will be a surety if last batch reached the user successfully or we should move on. Before validating this behavior, I was expecting a unique scroll-id every time which I can pass to ElasticSearch to confirm response was fetched successfully over the wire and now I am ready for next set of response. And in turn either ES will entertain me with past requested data, or will throw an exception for not having the right token as it was already been used.

One more question, does it impact the time to scroll if the document's size is large? Even though the fields requested are very light weight? I am seeing a performance degradation for same fields fetched from two mappings, out of which one has a bit extra heavy fields (which are not tried to fetch).

Thanks,
Pankaj

polyfractal · October 22, 2018, 5:03pm

Yep, this is a known limitation to the scroll API. There are two related issues open that are very close to what you suggested:

Replace `_scroll` with the ability to acquire point-in-time views + search_after · Issue #26472 · elastic/elasticsearch · GitHub
Point in time reader context for multiple (scroll) queries · Issue #25674 · elastic/elasticsearch · GitHub

The root problem is that scrolls only know how to "go forward" internally, so there's no way to retry the last page. It's fundamentally a technical limitation to how things work right now, which is why any solution to this would need some retooling like the two above tickets suggest.

One more question, does it impact the time to scroll if the document's size is large?

I would not be surprised if this affected scroll times. The likely culprit is just increased time to fetch and decompress the document off disk, increased disk IO, slightly increased time spent parsing/serializing JSON, etc.

system · November 19, 2018, 5:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Resuming scroll request after intermediate failure Elasticsearch	10	3680	July 8, 2017
Bulk Data using Scan & Scroll API Elasticsearch	4	950	July 5, 2017
Do unique/reusable _scroll_ids exist? Elasticsearch	4	1511	July 6, 2017
Limit scroll result set by size in bytes instead of documents? Elasticsearch	4	1128	July 6, 2017
Retrieving millions of large documents Elasticsearch	7	381	September 25, 2023

Scroll replay doesn't work due to constant scroll id

Related topics