Hi,
I am upgrading our elastic from 1.5.2 to 2.3.4. I wrote python re index script that is using elasticsearch.helpers.parallel_bulk.
When I am running the script, its reindexing perfectly 99% of the docs, but the last bulk of docs is never reindexed.
for example, if I have 2276383 docs to reindex, and I am running parallel_bulk with chunk_size=500, then its reindexing 2276000,
and the last 383 files are not indexed and script never ends. How can I make it reindex last docs?
The function that I wrote to use parallel_bulk is:
def parallel_reindex(index_name, doc_type, chunk_size=500, scroll='10m', scan_kwargs={}, bulk_kwargs={}):
target_client = Elasticsearch(hosts = ['node01:9200', 'node02:9200', 'node03:9200'], retry_on_timeout = True, max_retries = 10, timeout = 1000)
source_client = Elasticsearch(hosts = ['node04:9200', 'node05:9200', 'node06:9200'], retry_on_timeout=True, max_retries=10, timeout=1000)
query = {"query": {"match_all": {}}}
docs = scan(source_client,
query = query,
index = index_name,
scroll = scroll,
doc_type=doc_type,
** scan_kwargs
)
def _change_doc_params_to_elastic_2(hits, target_client):
for h in hits:
# make some changes
yield h
kwargs = {
'stats_only': True,
}
kwargs.update(bulk_kwargs)
for response in parallel_bulk(target_client, _change_doc_params_to_elastic_2(docs, target_client), thread_count=8, chunk_size=chunk_size, max_chunk_bytes=20 * 1014 * 1024):
pass
print("Done parallel_reindex of doc_type %s in index %s" % (doc_type,index_name))
Thanks