Reindex parent-child documents using elasticsearch.helpers paralel_bulk


(Moshe Sucaz) #1

Hi,
I wrote my own script to re-index our docs from elastic 1.5.2 to 2.3.4 using the elasticsearch.helpers
paralel_bulk.
My question is: Will paralel_bulk re-index children documents with “parent=” ? Because when debugging it, I didn't see “parent” in the returned doc.
My script is:

def parallel_reindex(index_name, doc_type, chunk_size=500, scroll='10m', scan_kwargs={}, bulk_kwargs={}):
target_client = Elasticsearch(hosts=['target_host:9200'], retry_on_timeout=True, max_retries=10, timeout=1000)
source_client = Elasticsearch(hosts=['source_host:9200'], retry_on_timeout=True, max_retries=10, timeout=1000)
query = {"query": {"match_all": {}}}
docs = scan(source_client,
query = query,
index = index_name,
scroll = scroll,
doc_type=doc_type,
** scan_kwargs
)
def change_doc_params_to_elastic_2(hits):
for h in hits:
# change field with “.” to “

if 'x.y.z' in h['_source']:
h['_source']['x_y_z'] = h['_source']['x.y.z']
del h['_source']['x.y.z']
# removing _analyzer
if '_analyzer' in h['_source']:
del h['_source']['_analyzer']
if 'fields' in h:
h.update(h.pop('fields'))
yield h
kwargs = {
'stats_only': True,
}
kwargs.update(bulk_kwargs)
for response in parallel_bulk(target_client, _change_doc_params_to_elastic_2(docs), thread_count=8, chunk_size=chunk_size):
pass
# reindex_log.info("responce: ", response)

pool = ThreadPool(3)
for doc_type in ["a",”b”,”c”]: #b and c are children's of a
pool.add_task(parallel_reindex, index_name, doc_type)
reindex_log.info("waiting for all child docs to finish\n")
pool.wait_completion()


(Mark Harwood) #2

In 1.x the _parent field isn't included by default in responses.
You need to ask for it explicitly in your requests.


(Moshe Sucaz) #3

Thanks Mark!
I will do that...


(system) #4