I'm currently reindexing 3.2bn documents from a 5 shard index into a 20 shard index on a 2-node cluster where the machines are AWS EC2 i3.4xlarge instances.
The elasticsearch installs are configured to use both ephemeral NVMe SSDs (each at 1.9TB).
The reindexing runs through a Groovy script that simply takes the old ctx.id, calculates a SHA-256 of the value and resets it.
The batch size on the reindex API call was set to 10,000.
The source and destination indices have 0 replicas, and the refresh interval on the destination index is -1.
Document routing is being used. Shard data is lumpy as hell, but that's a problem for another day.
No other activity is happening on this cluster.
What bothers me is that it's taking a long time, but netdata stats are showing that the disk utilisation is very low, memory is well within limits (although I'm getting a saw-tooth JVM Heap graph, so may need to decrease heap size?), and CPU doesn't peak above 25%. This is the case with both nodes.
top shows that java is indeed the top process, but is only using 2 cores.
The nodes API with the os flag enabled shows that it has correctly identified and allocated the available number of processors.
Is there any way I can configure elasticsearch to totally smash through the data and stress the machines a bit more? At the current rate it's going to take months to complete.
I would consider adding more nodes, but is there any point given that the existing ones aren't being used to their maximum potential?
Getting the hot threads show that bulk processes are doing their thing, so I'm wondering about changing the thread_pool.bulk settings, currently defaulting to this:
In general when folks have a lot of reindexing to do I suggest that they split the process up into multiple reindex requests that'll finish in tens of minutes based some natural key in their data. And then they can parallelize the process with whatever tools they like. And they can spot check the results and all that.
Great! Yes, totally run multiple batches in parallel. If you push it too hard then the reindex will get execution rejections and will back off and retry. This isn't great for speed though. In other words: if you push it too hard it'll go slower. But you can and should manually parallelize it.
Each of my nodes has two disks. My index has 20 shards, there are 10 shards on each node.
I've noticed that one of the nodes is working harder than the other one (IO-utilisation).
A closer look shows that one node has got 5 shards per disk, the other one is a bit more unbalanced, with 3 on one disk and 7 on another.
How is this decision made? Is it related to the shard size? (some of my shards are much larger than others due to lumpy document routing). Is there a way to force es to evenly spread the shards to each disk, and is that even a good idea?
It looks from the data that the one with seemingly fewer shards from my index has more data for other miscellaneous/internal indices (created by Kibana/Monitoring etc).
That sort of makes sense. It's useful to know. In future, for example, I'd hold off on installing kibana until my indices have spread their data nicely across all the disks.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.