I use Hadoop plugin a lot, it's quite fast because it takes advantage of shards to work in parallel.
Is reindex API do the same? (or can)
I use Hadoop plugin a lot, it's quite fast because it takes advantage of shards to work in parallel.
Is reindex API do the same? (or can)
Hi @ebuildy,
I am not sure whether I've misunderstood you. The reindex API basically uses sliced scrolls to retrieve documents from the source index and uses the bulk API to put the documents into the destination index. The number of shards is determined by the index settings and not really related to the reindex API.
Daniel
I am wondering if reindex API works like es4Hadoop does:
scrolling in parrallel from each shard ====> _bulk
Let's say you have 3 data nodes, you want to reindex an index with 3 shards, that mean, 1 data node could copy 1 shard, you see?
Hi @ebuildy,
ok, got you! The reindex action is coordinated by one node in the cluster. Processing of the sliced scrolls is done in parallel however.
Daniel
In 5.1 where is it implemented at all. Before that reindex was always a single process. That isn't to say it is a single thread, just that it isn't forked. The search and bulk stages ran on as many threads as they usually do in Elasticsearch.
Thanks for the additional detail on my answer @nik9000. I was looking at the implementation on the master branch indeed.
Daniel
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.