REINDEX API - Node choice when using TASK API

timost · June 15, 2017, 4:38pm

I'm running an ES 5.4.1 cluster with only two nodes. One uses HDDs (2TB) the other uses SSDs (500GB), both can be masters and data nodes. The HDD node holds roughly 75% of the data and the remaining is on the SSD node.

I'm in the process of reindexing the data on the cluster using the REINDEX API and the TASK API with slices as described in the documentation. All the reindexing tasks seem to run on the HDD node which I think make them slower than if they would be ran on the SSD node.

My questions are the following:

How does Elasticsearch choose the node on which to run the reindex task ?
My guess would be that it runs on the node where the shard is but I could not find anything regarding this in the doc.
Is there a way to force the task to be run on a given node of the cluster ?

Thank you for your help !

warkolm · June 20, 2017, 7:43am

It'll reindex on whichever nodes it needs to based on the the allocation of the shards, as you thought.

If you want to force it use allocation awareness/filtering to put the index where you want it.

timost · June 20, 2017, 7:58am

Thank you for your answer @warkolm , I'll try this !

timost · June 23, 2017, 7:14am

In case anyone is interested, I've been able to try this procedure. Reindexing on the SSD node was roughly 5 times faster than on the HDD node ( 1500-2000 Doc/s vs 10 000 Doc/s) see the screenshot below.

The things I had to do in order to make this happen are:

Reroute source index shards to the SSD node using the cluster reroute api or simply the shard allocation filtering mecanism. The reroute api is useful to understand why a shard isn't relocated.
Check indices settings do not conflict with the reroute order. Especially check that replica shards aren't on the target node
Make sure target indices are also on the SSD node (there is a bottleneck on both read and write operations on disks)
Increase cluster.routing.allocation.node_concurrent_recoveries to make rerouting shards faster

One thing that amazed me is that you can relocate the shards while the reindexing process on those shards is running.

In my particular case, I had limited storage available on the SSD node so my workflow was:

Move all the shards to be reindexed to the SSD node.
Move or create the target indices on the SSD node.
Start the reindexing process.
Delete the source index.
Move the target indices back to the HDD node if needed to save space.
Repeat for next index.

Even with the overhead of relocating the shards (~5 minutes per index), it was still much faster to do this than to reindex on the HDD node.

warkolm · June 23, 2017, 7:31am

Great to hear, thanks for sharing such a detailed outcome too!

system · July 21, 2017, 7:31am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reindex API performance Elasticsearch	3	4494	July 5, 2017
Does ES reindex a shard that's been moved to another node? Elasticsearch	14	317	December 13, 2023
Improve reindex speed into new cluster Elasticsearch	4	1090	January 5, 2019
Why doesn't the Reindex API parallelize by shard automatically? Elasticsearch	7	886	July 5, 2017
Improving Reindex Performance in v5.6 Elasticsearch	8	660	January 18, 2019

REINDEX API - Node choice when using TASK API

Related topics