As has been reported elsewhere here, we too were very surprised at how slow the Reindex API is if you simply follow the documentation, and issue a straight reindex.
For reference we have a 9M doc/4 TB index, simple clean schema but large documents, on a well provisioned 10-data-node [+masters/clients] 2.3.4 cluster under minimal load... and it is taking several days to reindex, peaking at at best 100 doc/s and averaging more like 25. This is with an identical schema and no source filtering or rewriting etc... just a simple pour-over from one shard count to a higher one.
Reading here I found the suggestion to perform manual parallelization by effectively 'sharding' the index according to some feature of the source. Presumably filtering against a unique key.
I'd like to request that in its stable form (5.x+) the Reindex API at minimum automatically perform concurrent scroll/bulk posts by shard. Presumably the same logic used to shard documents by id could also be used to partition into parallel read processes.
It would be invaluable to automatically make each shard reindex in parallel.
(The ideal of course would be that there's simply an option in the initiating operation, say "concurrency": 10 ... It's funny that there are dials for throttling, but the common experience seems to be that an over-exuberant reindexing process is the least of anyone's worries!)
This won't help you with the built in API but we built a tool called Teraslice because we needed much higher throughput for these types of operations along with the ability to cross clusters. Recently used it to migrate 71B records from a single cluster into 8 clusters in less than 14 hours. Something like 1.2M docs/s.
Speed notwithstanding the API has been a godsend. We had opportunity to put the 'create' operation filter to test when the node doing the index was rebooted before the process finished, and the fact that our 'resume' of reindexing required only adding a parameter to the original post was lovely.
We're going to experiment with an application-level parallelization this week, also; we have a motivating test case that requires some additional fields in our index which have to be fetched per document.
Fetching things per document sounds like a thing that should be handled by
an external script rather than reindex. It'll probably work though....
This thread did remind me that I needed to work on parallelizing reindex. I
reached out to the author of parallel scroll and we talked through a plan.
I'll put together a proposal soon-ish, though it isn't going to make it for
5.0. 5.1 if all goes well.
Indeed, for this run we're going to do an application-side reindex! I did briefly consider whether there was value in maybe first indexing the fields we need to add into a temporary secondary index, then doing document merges (with all the data actually be fetchable as existing ES documents in the same cluster), but that's definitely overkill for our immediate case.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.