Today I launched a reindex task for a big index (4.1TB) in order to provide it with a more appropiate number of shards (10 -> 90)
At the beginning it was really good (30K/s with spikes of 45K) but after 9 hours is around 7K/s which is quite disappointing as at this pace it will never finish.
What type of factors could be involved in this dramatic decrease in indexing performance ? i see segment count has stayed quite stable for most of the time
I believe the reindex API keeps the document ID when indexing, which means each indexing operation is in reality an update as it need to check if the document already exists. An update is much slower than an indexing operation where Elasticsearch is allowed to set the ID automatically as that can never result in an ID collision. If you have slow storage and large indices, the slowdown can be significant over time and is likely to continue deteriorating.
If you do not need to keep the document IDs one way to speed this up might be to reindex using an ingest pipeline that removes the _id field. I believe this should be possible but have not tried it. Be aware that this potentially could lead to duplicates in case the reindex process is forced to retry.
Depending on which version you are using, you may want to make sure you set up the index so you can use the split index API if needed in the future.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.