I'm running an ES 5.4.1 cluster with only two nodes. One uses HDDs (2TB) the other uses SSDs (500GB), both can be masters and data nodes. The HDD node holds roughly 75% of the data and the remaining is on the SSD node.
I'm in the process of reindexing the data on the cluster using the REINDEX API and the TASK API with slices as described in the documentation. All the reindexing tasks seem to run on the HDD node which I think make them slower than if they would be ran on the SSD node.
My questions are the following:
How does Elasticsearch choose the node on which to run the reindex task ?
My guess would be that it runs on the node where the shard is but I could not find anything regarding this in the doc.
Is there a way to force the task to be run on a given node of the cluster ?
In case anyone is interested, I've been able to try this procedure. Reindexing on the SSD node was roughly 5 times faster than on the HDD node ( 1500-2000 Doc/s vs 10 000 Doc/s) see the screenshot below.
The things I had to do in order to make this happen are:
Reroute source index shards to the SSD node using the cluster reroute api or simply the shard allocation filtering mecanism. The reroute api is useful to understand why a shard isn't relocated.
Check indices settings do not conflict with the reroute order. Especially check that replica shards aren't on the target node
Make sure target indices are also on the SSD node (there is a bottleneck on both read and write operations on disks)
Increase cluster.routing.allocation.node_concurrent_recoveries to make rerouting shards faster
One thing that amazed me is that you can relocate the shards while the reindexing process on those shards is running.
In my particular case, I had limited storage available on the SSD node so my workflow was:
Move all the shards to be reindexed to the SSD node.
Move or create the target indices on the SSD node.
Start the reindexing process.
Delete the source index.
Move the target indices back to the HDD node if needed to save space.
Repeat for next index.
Even with the overhead of relocating the shards (~5 minutes per index), it was still much faster to do this than to reindex on the HDD node.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.