Reindexing via scan type - revisited

Clinton_Gormley · March 12, 2013, 10:23am

Hiya

I have been testing against a staging cluster of ~1M docs and it works well.
scroll was set to '10m' and it took about 1 hr to scan and reindex 150,000
docs, across network traffic.

Question: You mention "don't set the scroll param to eg '1d'. It stops old
segments from being cleaned up."
What is the mechanics behind this? For example, in my above testing, with a
scroll=10m and the system running continuously for a few hours, will that
affect cleanup's? What about read's and write's from other clients? Will
write's get blocked and will read's see stale data for those few hours? I'm
preparing for PROD run's against live data and traffic and the total number
of docs is about 45M docs. What advice would you give to tackle this
scan-and-reindex process? My plan was to run 1M at a time during low
traffic hours between 10pm to 5am PST. This conservative approach will take
me 45 days! Or I could run it 24x7 which would take ~5days.

As you index new documents, ES writes new "segments", where a segment is
like a fully functional inverted index all by itself. When you do a
search, ES searches through all the current segments, one by one. Every
second, ES refreshes its view on search. That is, it opens readers
against all the current segments.

As more segments get written, so ES will merge eg 3 smaller segments
into 1 new bigger segment. Normally these old segments are deleted, and
ES starts searching in the new segment.

When you specify that you want a scroll, ES takes a snapshot of the
current segments, and remembers them. So the results from your scroll
request always reflect the results as they were at that point in time.
Results for the scroll request won't change as you keep indexing.
However, new search requests WILL see the new segments and will return
fresh data.

When merges happen, the segments involved in a scroll request aren't
deleted. They stick around until the scroll is finished, or the scroll
timeout is reached. The timeout is renewed every time you pull another
tranche of results from the scroll request.

So the only thing to keep in mind is that you end up having many more
segments open than usual, which can use up file descriptors, and memory
etc. Make sure you have enough of both to last the whole time required
to finish your reindex.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Resume scroll-scan query? Elasticsearch	5	1580	July 6, 2017
Reindexing via scan search type Elasticsearch	14	674	July 6, 2017
Bug when scrolling? Elasticsearch	10	375	July 6, 2017
How to index docs using Scan and Scroll Elasticsearch	2	572	July 5, 2017
Reindex using scroll api Elasticsearch	5	2245	July 5, 2017

Reindexing via scan type - revisited

Related topics