Reindexing via scan type - revisited

I read this thread - http://elasticsearch-users.115913.n3.nabble.com/reindexing-via-scan-search-type-td3256635.html#a3259255

and have a few more questions.

I'm using pyes(0.16.0) client running against ES server(0.19.x).

The installation I am working on has 3 different indices, the biggest of which has about 50M docs. 3 nodes, each hosting the 3 indices and replicas. i.e. each index replicated to other 2 nodes. 5 shards.

The installation is PROD, serving live read/write traffic. Because business SLA has changed and a number of fields have been deprecated, I would like to 'remove' these fields and 'compact' the indices without changing the underlying mapping. But instead of a new mapping and indexing from scratch, I was hoping to do a 'rolling' reindex of 'dehydrated' docs.

Python code:
batch_size = 5
n = 0
while True:
docs = scan_index(start=n, size=batch_size) # get next batch of docs from index
if not docs:
break
n += len(docs)
new_docs = dehydrate(docs) # create new doc out of old doc sans deprecated fields
reindex(new_docs) # reindex new docs. This involves a delete(), followed by index()
time.sleep(0.5)

From the previous thread, I'm aware of potential issues with delete's but delete's are necessary before reindex can take place.

In testing my 'dehydrate' Python code, I first created test indices of about 50 docs with the old mapping. A run with size=10 would simulate 5 batch reads/scan's and 5 batch reindex's.

Problem(which I kind of anticipated):
a) scan_index() 'misses' some docs. However, multiple runs will eventually catch most if not all docs.

I suspect my 'scan' logic is faulty and would appreciate comments. Please bear with me with (abridged copy) of my Python and pyes client codes:

from pyes import ES

def scan_es(start, size):
connection = ES(['localhost:9200'])
query = {'query':{'match_all':{}}, 'from':start, 'size':size}

resultset = connection.search(query, indexes=['test_index'], scan=True)

for r in resultset['hits']['hits']:
    yield r['_source']

def scan_index(start, size):
docs = []
for d in scan_es(start, size):
docs.append(d)
return docs

def reindex(docs):
connection = ES(['localhost:9200'])
for doc in docs:
try:
connection.get('test_index', 'test_doc', doc['id'], fields=['_id'])
connection.delete('test_index', 'test_doc', doc['id'])
connection.index(doc, 'test_index', 'test_doc', doc['id'])
except:
pass

Questions:

  1. So each batch would invoke a new call to scan_index/scan_es. How does the scroll_id get managed? I looked at github pyes code and the 'scan=True' seem to automagically manage the scroll_id but I didn't dig deep enough to figure out if it gets reset on each search() call.
  2. There is a pyes search_scroll() API but I have no idea how to manage the scroll_id. Advice pls.
  3. The rolling reindex is expected to take a few days which is intentional so as not to slam PROD nodes. I can only increase batch_size to maybe 10,000 and continue to be nice to the nodes.
  4. What would be a more elegant way of achieving this?

Questions:

  1. So each batch would invoke a new call to scan_index/scan_es. How does
    the scroll_id get managed? I looked at github pyes code and the 'scan=True'
    seem to automagically manage the scroll_id but I didn't dig deep enough to
    figure out if it gets reset on each search() call.

Looking at the pyes code, it looks like, if you set 'scan' then it
automatically sets a scroll time as well.

However, you shouldn't repeat the search. Instead, you should extract
the scroll ID from each previous request, and pass it (plus a scroll
time) to /_search/scroll.

Aparo actually provides the loop to do that here:

  1. The rolling reindex is expected to take a few days which is intentional
    so as not to slam PROD nodes. I can only increase batch_size to maybe
    10,000 and continue to be nice to the nodes.

10,000 is a lot. Also consider that the number of results returned =
$size * $number_of_shards. So with 5 shards, and a size of 10,000 you'd
get back 50,000 results. I find that the sweet spot is somewhere
between 500 and 5,000 (depending on doc size, RAM etc)

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks, Clinton. I'll give that a shot.

A few more questions:
a) Is the scroll id consistent (immutable)? I'm thinking of a scenario where I can have cron jobs running every midnight for say 5 hours, save the last scroll id and exit. Subsequent jobs pick up from the last scroll id. Rinse, repeat. All the while the live index continues getting new docs as well as reindexed docs.
b) Do the 2nd and subsequent jobs calls search_scroll, bypassing the search(...,scan=True)? If not, set the start=last_scroll_id on each restart followed by search_scroll()?

Thanks,
dave

Hi Dave

A few more questions:
a) Is the scroll id consistent (immutable)? I'm thinking of a scenario
where I can have cron jobs running every midnight for say 5 hours, save the
last scroll id and exit. Subsequent jobs pick up from the last scroll id.
Rinse, repeat. All the while the live index continues getting new docs as
well as reindexed docs.

The scroll ID can change on each scroll request, so for each subsequent
scroll request, you need to use the scroll ID from the previous request.

The scroll ID will expire 'scroll' time after the last request. Eg if
you set 'scroll' to '1m', then one minute after the previous request,
the scroll ID will expire. You need to set the 'scroll' param on each
scroll request as well.

You don't want to keep these around too long, so don't set the scroll
param to eg '1d'. It stops old segments from being cleaned up.

b) Do the 2nd and subsequent jobs calls search_scroll, bypassing the
search(...,scan=True)? If not, set the start=last_scroll_id on each restart
followed by search_scroll()?

The first request should be to /_search and subsequent requests
to /_search/scroll

clint

Thanks,
dave

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Reindexing-via-scan-type-revisited-tp4029731p4029849.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks again, Clinton.

I have been testing against a staging cluster of ~1M docs and it works well. scroll was set to '10m' and it took about 1 hr to scan and reindex 150,000 docs, across network traffic.

Question: You mention "don't set the scroll param to eg '1d'. It stops old segments from being cleaned up."
What is the mechanics behind this? For example, in my above testing, with a scroll=10m and the system running continuously for a few hours, will that affect cleanup's? What about read's and write's from other clients? Will write's get blocked and will read's see stale data for those few hours? I'm preparing for PROD run's against live data and traffic and the total number of docs is about 45M docs. What advice would you give to tackle this scan-and-reindex process? My plan was to run 1M at a time during low traffic hours between 10pm to 5am PST. This conservative approach will take me 45 days! Or I could run it 24x7 which would take ~5days.

Edit: My concern is with the read's and write's locks if any. CPU-wise, with some throttling, the additional load is 15% and well within tolerance.

Hiya

I have been testing against a staging cluster of ~1M docs and it works well.
scroll was set to '10m' and it took about 1 hr to scan and reindex 150,000
docs, across network traffic.

Question: You mention "don't set the scroll param to eg '1d'. It stops old
segments from being cleaned up."
What is the mechanics behind this? For example, in my above testing, with a
scroll=10m and the system running continuously for a few hours, will that
affect cleanup's? What about read's and write's from other clients? Will
write's get blocked and will read's see stale data for those few hours? I'm
preparing for PROD run's against live data and traffic and the total number
of docs is about 45M docs. What advice would you give to tackle this
scan-and-reindex process? My plan was to run 1M at a time during low
traffic hours between 10pm to 5am PST. This conservative approach will take
me 45 days! Or I could run it 24x7 which would take ~5days.

As you index new documents, ES writes new "segments", where a segment is
like a fully functional inverted index all by itself. When you do a
search, ES searches through all the current segments, one by one. Every
second, ES refreshes its view on search. That is, it opens readers
against all the current segments.

As more segments get written, so ES will merge eg 3 smaller segments
into 1 new bigger segment. Normally these old segments are deleted, and
ES starts searching in the new segment.

When you specify that you want a scroll, ES takes a snapshot of the
current segments, and remembers them. So the results from your scroll
request always reflect the results as they were at that point in time.
Results for the scroll request won't change as you keep indexing.
However, new search requests WILL see the new segments and will return
fresh data.

When merges happen, the segments involved in a scroll request aren't
deleted. They stick around until the scroll is finished, or the scroll
timeout is reached. The timeout is renewed every time you pull another
tranche of results from the scroll request.

So the only thing to keep in mind is that you end up having many more
segments open than usual, which can use up file descriptors, and memory
etc. Make sure you have enough of both to last the whole time required
to finish your reindex.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.