I'm reindexing 9.2TB-data index (~2bn documents) from a v2-created index (restored onto v5.6 cluster) into a v5 created 20-shard index (on same cluster). The elasticsearch cluster consists of 9-nodes, each with 32 GiB RAM, 8 cores and a 4TB SSD.
It's taken about 12 days so far and seems to have slowed right down. My netdata dashboard shows that CPU is not being taxed at all, disk utilisation is up (as expected) and RAM is in high use.
Reindex batch size was 10,000, and everything runs through a groovy script that converts the v5-incompatible document IDs to a SHA256 hashes of themselves. Reindexing rate on the index is being reported as ~300-400/sec, and query rate from the source index is <50/sec. index.refresh_interval is set to -1 (although I only did that today after some more rooting around).
I'm a little bit worried. I've got another reindex process to run on a 5-shard index that has ~3.2bn documents in it, although it's only 2.3TB data.
This is all taking much longer than expected.
My question is if I roll another node into the cluster, will it adversely affect the reindex process? I'm assuming that as soon as another node becomes available, ES will start balancing the shards. If that conflicts with the reindex process I'll be extremely distraught!
I moved this one to it's own topic. Your request is about reindex speed, but your setup is different so it makes sense to ask questions specifically about it
In general I recommend folks break up big reindex tasks into many smaller ones and manage them manually or with a simple bash script. Smaller tasks can be stopped and restarted and you get a real progress report as the small ones finish. If you have a date field in your source index it is usually fairly easy to reindex a day's worth of docs at a time. Or an hour. Or a month. It depends on the number of documents you have and how small you'd like your batches to be.
I changed the index settings so that there were 0 replicas. This made kibana show that the indexing rate was negative for a while, but brought the index memory and number of segments right down.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.