Resuming scroll request after intermediate failure

If we are scrolling through a large index (say 50M) doc and one of the requests after scrolling through 99% of the docs fail, it looks like we have to start all over again, which is damn expensive and feels inefficient.

I hope I am missing something. I wanted to check if there is a way to resume scroll request. Using the last scroll_id does not return the same set of documents.

Please suggest how to handle this failure scenario.

Hello,

What kinda failure are you referring to? Can you please post the error message?

Cheers

This is just a general design question so that we can handle error messaging appropriately. Failing at 99% and restarting didn't make much sense to me (there wasn't any failure since we haven't used it yet), so wanted to double check from experts.

A scroll context is stateful and is bound to the nodes where the primary scrolling shards are assigned and there is no high availability guarantees, which means that if one of those node fails, since there is no scroll context replication, then the scroll will fail. Another reason for it to fail is if index is closed and/or deleted.

Besides the reasons described above, I don't recall any other one that could cause a scroll context to fail (besides it simply timing out). That's why asked for you to post the error.

Sorry for the late reply, was out on vacation.

Thanks for the details about the scroll request. A error can happen because of network issue (the major root cause of majority of the live site issues on production). The error could be on client library as well (not necessarily elasticsearch cluster/node). To have the capability to continue using retry mechanism would be super useful in that case.

Again, I don't have any error messages yet, but it makes us nervous to use this feature in production because we don't have a good answer (there is a workaround, albeit an expensive one) to handle failure other than restarting all over again.

The scroll context will stay open for a specified amount of time. So in case of a network error, you can repeat the request for the same scroll id. But keep in mind that, depending on what type of network error, it could be the case that the request actually made it and Elasticsearch processed it, so if you could not get the response (because of network error) then the next request will be the next page in scroll context.

Also, you mentioned using scrolled request in production, are you referring to use these type of requests for the scale of user initiated searches?

I could retry, but there is no way to know if the request was successful on server side. This means I could miss documents in the scroll request.

We use scrolling for exporting complete index out of ES in production. That's an option that clients control.

Indeed this is a problem that can actually happen. I suggest that you explore other options:

  1. Maybe you could use Search After. It works differently from scroll context, but also with caveats on it's own, but maybe it is suitable for your case.
  2. Is this an on-prem cluster that you control? If so, you can deploy Spark and es-hadoop to export the documents directly from the data nodes.

from what I understand, search_after is expensive for exports. Would you mind elaborating a bit more about the spark and es-hadoop for exporting data?

You would create a Spark job that reads document from Elasticsearch (this integration is provided by es-hadoop) and writes somewhere else. You could even overlay the Spark cluster with Elasticsearch cluster for job+data collocation, this would make job more resilient and less error prone.

Check https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.