If we are scrolling through a large index (say 50M) doc and one of the requests after scrolling through 99% of the docs fail, it looks like we have to start all over again, which is damn expensive and feels inefficient.
I hope I am missing something. I wanted to check if there is a way to resume scroll request. Using the last scroll_id does not return the same set of documents.
Please suggest how to handle this failure scenario.
This is just a general design question so that we can handle error messaging appropriately. Failing at 99% and restarting didn't make much sense to me (there wasn't any failure since we haven't used it yet), so wanted to double check from experts.
A scroll context is stateful and is bound to the nodes where the primary scrolling shards are assigned and there is no high availability guarantees, which means that if one of those node fails, since there is no scroll context replication, then the scroll will fail. Another reason for it to fail is if index is closed and/or deleted.
Besides the reasons described above, I don't recall any other one that could cause a scroll context to fail (besides it simply timing out). That's why asked for you to post the error.
Thanks for the details about the scroll request. A error can happen because of network issue (the major root cause of majority of the live site issues on production). The error could be on client library as well (not necessarily elasticsearch cluster/node). To have the capability to continue using retry mechanism would be super useful in that case.
Again, I don't have any error messages yet, but it makes us nervous to use this feature in production because we don't have a good answer (there is a workaround, albeit an expensive one) to handle failure other than restarting all over again.
The scroll context will stay open for a specified amount of time. So in case of a network error, you can repeat the request for the same scroll id. But keep in mind that, depending on what type of network error, it could be the case that the request actually made it and Elasticsearch processed it, so if you could not get the response (because of network error) then the next request will be the next page in scroll context.
Also, you mentioned using scrolled request in production, are you referring to use these type of requests for the scale of user initiated searches?
You would create a Spark job that reads document from Elasticsearch (this integration is provided by es-hadoop) and writes somewhere else. You could even overlay the Spark cluster with Elasticsearch cluster for job+data collocation, this would make job more resilient and less error prone.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.