How to reindex last bulk of docs of elasticsearch.helpers.parallel_bulk?


(Moshe Sucaz) #1

Hi,
I am upgrading our elastic from 1.5.2 to 2.3.4. I wrote python re index script that is using elasticsearch.helpers.parallel_bulk.
When I am running the script, its reindexing perfectly 99% of the docs, but the last bulk of docs is never reindexed.
for example, if I have 2276383 docs to reindex, and I am running parallel_bulk with chunk_size=500, then its reindexing 2276000,
and the last 383 files are not indexed and script never ends. How can I make it reindex last docs?

The function that I wrote to use parallel_bulk is:

def parallel_reindex(index_name, doc_type, chunk_size=500, scroll='10m', scan_kwargs={}, bulk_kwargs={}):
target_client = Elasticsearch(hosts = ['node01:9200', 'node02:9200', 'node03:9200'], retry_on_timeout = True, max_retries = 10, timeout = 1000)
source_client = Elasticsearch(hosts = ['node04:9200', 'node05:9200', 'node06:9200'], retry_on_timeout=True, max_retries=10, timeout=1000)
query = {"query": {"match_all": {}}}
docs = scan(source_client,
query = query,
index = index_name,
scroll = scroll,
doc_type=doc_type,
** scan_kwargs
)

def _change_doc_params_to_elastic_2(hits, target_client):
for h in hits:
# make some changes
yield h
kwargs = {
'stats_only': True,
}
kwargs.update(bulk_kwargs)
for response in parallel_bulk(target_client, _change_doc_params_to_elastic_2(docs, target_client), thread_count=8, chunk_size=chunk_size, max_chunk_bytes=20 * 1014 * 1024):
pass

print("Done parallel_reindex of doc_type %s in index %s" % (doc_type,index_name))

Thanks


(Isabel Drost-Fromm) #2

I'm not particularly familiar with the Python client, so can't actually help you with your script.

What I am wondering though is whether the Reindex API that's built into 2.3.x would help you:

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/docs-reindex.html


(Moshe Sucaz) #3

Thanks Isabel,
When I started to work on the migration, I remember reading somewhere that the reindex API will not work for migration of 1.x to 2.x. I also have too many changes and I am indexing from 1.5.2 cluster to 2.3.4 new cluster, so I created my own script. Anyway, issue solved! It was something related to elasticsearcgh-py 2.x that doesn't support Elasticsearch version 1.x. (see: https://github.com/elastic/elasticsearch-py/issues/438). When I connected to 1.x cluster with pyes.es.ES, instead of elasticsearcgh-py, issue solved.

Thanks!


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.