Clearing a scroll during scan and scroll?

patrick99e99 · February 17, 2016, 10:44pm

I have a large document that I am trying to reindex in-place using scan and scroll.

I've created a new index, and am doing a search on the old index with a type of scan, with a match all query, size of 600, and then bulk indexing that data from _source, into the new index.

After I hit 600k documents, it fails with a timeout.. The scroll time I am using is 5 minutes, everything I have read on here says you shouldn't need a long scroll time... It also says that each scroll should reset the timeout, which clearly is not happening so I'm super confused...

Anyway, then I read in the docs that you can clear the scroll... So, I change my loop to clear the previous scroll with each loop... But as soon as I try to clear the scroll after bulk indexing, I get

Elasticsearch::Transport::Transport::Errors::NotFound: [404] {"_scroll_id":"c2NhbjswOzE7dG90YWxfaGl0czozMDAwOw==","took":1,"timed_out":false,"_shards":{"total":5,"successful":0,"failed":5,"failures":[{"status":404,"reason":"SearchContextMissingException[No search context found for id [121]]"},{"status":404,"reason":"SearchContextMissingException[No search context found for id [122]]"},{"status":404,"reason":"SearchContextMissingException[No search context found for id [123]]"},{"status":404,"reason":"SearchContextMissingException[No search context found for id [124]]"},{"status":404,"reason":"SearchContextMissingException[No search context found for id [125]]"}]},"hits":{"total":3000,"max_score":0.0,"hits":[]}}

What am I doing wrong?

msimos · February 17, 2016, 11:31pm

Hi,

You've probably seen this URL, however you may want to take a look at it again:

https://www.elastic.co/guide/en/elasticsearch/guide/1.x/scan-scroll.html

Initially when you create the scroll you specify a time with scroll=1m for example. Subsequent request using the scroll id also needs to have the same scroll=1m specified. This keeps the scroll open for an additional minute. As per the docs:

"Note that we again specify ?scroll=1m. The scroll expiry time is refreshed every time we run a scroll request, so it needs to give us only enough time to process the current batch of results, not all of the documents that match the query."

Lastly, for each subsequent scroll request you need to provide the new scroll id returned by the previous scroll request:

"The scroll request also returns a new _scroll_id. Every time we make the next scroll request, we must pass the _scroll_id returned by the previous scroll request."

So I would take a look if you're using this behavior for retrieving the results. You can use logstash to re-index your data. Take a look at this blog:

http://david.pilato.fr/blog/2015/05/20/reindex-elasticsearch-with-logstash/

Topic		Replies	Views
Scroll Questions Elasticsearch	7	2965	July 5, 2017
Manage the search context of underlying "scroll" during reindex api call Elasticsearch	5	647	July 5, 2017
Apparent scroll timeout error Elasticsearch	7	3079	July 6, 2017
Alternative search using “Scroll” API and “Search After” API for real-time queries Elasticsearch	5	1667	September 19, 2019
Should I disable scroll time if I don't explicitly use scroll in any search or index operation? Elasticsearch	2	337	September 18, 2023

Clearing a scroll during scan and scroll?

Related topics