Clearing a scroll during scan and scroll?

(Patrick Collins) #1

I have a large document that I am trying to reindex in-place using scan and scroll.

I've created a new index, and am doing a search on the old index with a type of scan, with a match all query, size of 600, and then bulk indexing that data from _source, into the new index.

After I hit 600k documents, it fails with a timeout.. The scroll time I am using is 5 minutes, everything I have read on here says you shouldn't need a long scroll time... It also says that each scroll should reset the timeout, which clearly is not happening so I'm super confused...

Anyway, then I read in the docs that you can clear the scroll... So, I change my loop to clear the previous scroll with each loop... But as soon as I try to clear the scroll after bulk indexing, I get

Elasticsearch::Transport::Transport::Errors::NotFound: [404] {"_scroll_id":"c2NhbjswOzE7dG90YWxfaGl0czozMDAwOw==","took":1,"timed_out":false,"_shards":{"total":5,"successful":0,"failed":5,"failures":[{"status":404,"reason":"SearchContextMissingException[No search context found for id [121]]"},{"status":404,"reason":"SearchContextMissingException[No search context found for id [122]]"},{"status":404,"reason":"SearchContextMissingException[No search context found for id [123]]"},{"status":404,"reason":"SearchContextMissingException[No search context found for id [124]]"},{"status":404,"reason":"SearchContextMissingException[No search context found for id [125]]"}]},"hits":{"total":3000,"max_score":0.0,"hits":[]}}

What am I doing wrong?

(Mike Simos) #2


You've probably seen this URL, however you may want to take a look at it again:

Initially when you create the scroll you specify a time with scroll=1m for example. Subsequent request using the scroll id also needs to have the same scroll=1m specified. This keeps the scroll open for an additional minute. As per the docs:

"Note that we again specify ?scroll=1m. The scroll expiry time is refreshed every time we run a scroll request, so it needs to give us only enough time to process the current batch of results, not all of the documents that match the query."

Lastly, for each subsequent scroll request you need to provide the new scroll id returned by the previous scroll request:

"The scroll request also returns a new _scroll_id. Every time we make the next scroll request, we must pass the _scroll_id returned by the previous scroll request."

So I would take a look if you're using this behavior for retrieving the results. You can use logstash to re-index your data. Take a look at this blog:

(system) #3