How to reindex last bulk of docs of elasticsearch.helpers.parallel_bulk?

(Moshe Sucaz) #1

I am upgrading our elastic from 1.5.2 to 2.3.4. I wrote python re index script that is using elasticsearch.helpers.parallel_bulk.
When I am running the script, its reindexing perfectly 99% of the docs, but the last bulk of docs is never reindexed.
for example, if I have 2276383 docs to reindex, and I am running parallel_bulk with chunk_size=500, then its reindexing 2276000,
and the last 383 files are not indexed and script never ends. How can I make it reindex last docs?

The function that I wrote to use parallel_bulk is:

def parallel_reindex(index_name, doc_type, chunk_size=500, scroll='10m', scan_kwargs={}, bulk_kwargs={}):
target_client = Elasticsearch(hosts = ['node01:9200', 'node02:9200', 'node03:9200'], retry_on_timeout = True, max_retries = 10, timeout = 1000)
source_client = Elasticsearch(hosts = ['node04:9200', 'node05:9200', 'node06:9200'], retry_on_timeout=True, max_retries=10, timeout=1000)
query = {"query": {"match_all": {}}}
docs = scan(source_client,
query = query,
index = index_name,
scroll = scroll,
** scan_kwargs

def _change_doc_params_to_elastic_2(hits, target_client):
for h in hits:
# make some changes
yield h
kwargs = {
'stats_only': True,
for response in parallel_bulk(target_client, _change_doc_params_to_elastic_2(docs, target_client), thread_count=8, chunk_size=chunk_size, max_chunk_bytes=20 * 1014 * 1024):

print("Done parallel_reindex of doc_type %s in index %s" % (doc_type,index_name))


(Isabel Drost-Fromm) #2

I'm not particularly familiar with the Python client, so can't actually help you with your script.

What I am wondering though is whether the Reindex API that's built into 2.3.x would help you:

(Moshe Sucaz) #3

Thanks Isabel,
When I started to work on the migration, I remember reading somewhere that the reindex API will not work for migration of 1.x to 2.x. I also have too many changes and I am indexing from 1.5.2 cluster to 2.3.4 new cluster, so I created my own script. Anyway, issue solved! It was something related to elasticsearcgh-py 2.x that doesn't support Elasticsearch version 1.x. (see: When I connected to 1.x cluster with, instead of elasticsearcgh-py, issue solved.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.