Is reindex API works in parallel by shard?


(Thomas Decaux) #1

I use Hadoop plugin a lot, it's quite fast because it takes advantage of shards to work in parallel.

Is reindex API do the same? (or can)


(Daniel Mitterdorfer) #2

Hi @ebuildy,

I am not sure whether I've misunderstood you. The reindex API basically uses sliced scrolls to retrieve documents from the source index and uses the bulk API to put the documents into the destination index. The number of shards is determined by the index settings and not really related to the reindex API.

Daniel


(Thomas Decaux) #3

I am wondering if reindex API works like es4Hadoop does:

scrolling in parrallel from each shard ====> _bulk

Let's say you have 3 data nodes, you want to reindex an index with 3 shards, that mean, 1 data node could copy 1 shard, you see?


(Daniel Mitterdorfer) #4

Hi @ebuildy,

ok, got you! :slight_smile: The reindex action is coordinated by one node in the cluster. Processing of the sliced scrolls is done in parallel however.

Daniel


(Nik Everett) #5

In 5.1 where is it implemented at all. Before that reindex was always a single process. That isn't to say it is a single thread, just that it isn't forked. The search and bulk stages ran on as many threads as they usually do in Elasticsearch.


(Daniel Mitterdorfer) #6

Thanks for the additional detail on my answer @nik9000. I was looking at the implementation on the master branch indeed.

Daniel


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.