Why doesn't the Reindex API parallelize by shard automatically?

aaron_ximm · September 14, 2016, 5:14pm

As has been reported elsewhere here, we too were very surprised at how slow the Reindex API is if you simply follow the documentation, and issue a straight reindex.

For reference we have a 9M doc/4 TB index, simple clean schema but large documents, on a well provisioned 10-data-node [+masters/clients] 2.3.4 cluster under minimal load... and it is taking several days to reindex, peaking at at best 100 doc/s and averaging more like 25. This is with an identical schema and no source filtering or rewriting etc... just a simple pour-over from one shard count to a higher one.

Reading here I found the suggestion to perform manual parallelization by effectively 'sharding' the index according to some feature of the source. Presumably filtering against a unique key.

I'd like to request that in its stable form (5.x+) the Reindex API at minimum automatically perform concurrent scroll/bulk posts by shard. Presumably the same logic used to shard documents by id could also be used to partition into parallel read processes.

It would be invaluable to automatically make each shard reindex in parallel.

(The ideal of course would be that there's simply an option in the initiating operation, say "concurrency": 10 ... It's funny that there are dials for throttling, but the common experience seems to be that an over-exuberant reindexing process is the least of anyone's worries!)

We can dream...!

Aaron

kstaken · September 14, 2016, 6:30pm

This won't help you with the built in API but we built a tool called Teraslice because we needed much higher throughput for these types of operations along with the ability to cross clusters. Recently used it to migrate 71B records from a single cluster into 8 clusters in less than 14 hours. Something like 1.2M docs/s.

Kimbro

warkolm · September 14, 2016, 10:05pm

This is a known limitation, we're working on improving it for 5.0.

aaron_ximm · September 20, 2016, 12:14am

Awesome news, thanks Mark.

Speed notwithstanding the API has been a godsend. We had opportunity to put the 'create' operation filter to test when the node doing the index was rebooted before the process finished, and the fact that our 'resume' of reindexing required only adding a parameter to the original post was lovely.

We're going to experiment with an application-level parallelization this week, also; we have a motivating test case that requires some additional fields in our index which have to be fetched per document.

Best,
Aaron

nik9000 · September 20, 2016, 12:35am

Fetching things per document sounds like a thing that should be handled by
an external script rather than reindex. It'll probably work though....

This thread did remind me that I needed to work on parallelizing reindex. I
reached out to the author of parallel scroll and we talked through a plan.
I'll put together a proposal soon-ish, though it isn't going to make it for
5.0. 5.1 if all goes well.

aaron_ximm · September 20, 2016, 5:03pm

Hi Nik,

Indeed, for this run we're going to do an application-side reindex! I did briefly consider whether there was value in maybe first indexing the fields we need to add into a temporary secondary index, then doing document merges (with all the data actually be fetchable as existing ES documents in the same cluster), but that's definitely overkill for our immediate case.

Thanks for the update!
Aaron

nik9000 · October 5, 2016, 6:42pm

I did finally open this:

github.com/elastic/elasticsearch

Add automatic parallelization support to reindex and friends

elastic:master ← nik9000:reindex_auto_slice

opened 04:51PM - 05 Oct 16 UTC

nik9000

+4250 -989

Adds support for `?slices=N` to reindex which automatically parallelizes the pr…ocess using parallel scrolls on `_uid`. Performance testing sees a 3x performance improvement for simple docs on decent hardware, maybe 30% performance improvement for more complex docs. Still compelling, especially because clusters should be able to get closer to the 3x than the 30% number. Closes #20624 Edit: this used to say the below things but I've since changed it to make it reflect what I ended up implementing: Adds support for `?workers=N` to reindex which automatically parallelizes the process using parallel scrolls on `_uid`. Simple performance testing sees a 3x performance improvement.

At this point I targeted it at 5.1.

Topic		Replies	Views
Is reindex API works in parallel by shard? Elasticsearch	6	888	January 4, 2017
Reindex API performance Elasticsearch	3	4600	July 5, 2017
Reindex API: parallel reindex requests, reindexing while still indexing to source indices Elasticsearch	5	1901	September 1, 2017
Reindexing 20TB document tips Elasticsearch	15	1661	July 29, 2019
ElasticSearch and parallelization Elasticsearch	2	2117	July 6, 2017

Why doesn't the Reindex API parallelize by shard automatically?

Related topics