We want to reindex documents into our new 7.2 ES cluster from our old 1.7 ES cluster (around 2TB of data)
For indexing purpose, we are going to use the BulkProcessor API. We had three approaches in mind
Have N (configurable) threads reading from non overlapping time frames from 1.7 cluster and have one BulkProcessor Instance with setConcurrentRequests set to N
Have N thread consuming from non overlapping time frames from 1.7 and have N BulkProcessors with setConcurrentRequests as 1 for each
Have 1 thread consuming from 1.7 and have 1 BulkProcessor with setConcurrentRequests as N
I'm planning to keep the rest of the BulkProcessor settings to their defaults only.
My question is, for 1, do you see a lock contention in BulkProcessor when we add BulkRequests to it given that multiple threads may call add of the same bulk processor? From source code, I do see we acquire a lock in add call
To mitigate the possible lock contention, we planned for 2 & 3
With 2, since there will be 1:1 mapping between producer thread and the BulkProcessor, I don't see a lock contention there
With 3, since there is a single producer, I don't see a contention again
I'd definitely experiment various strategies, but i'm kind of running out of time, so just getting a hint of benefits/caveats of the 3 approach would help me start off
To me, the third solutions looks, as if it has the fewest moving parts on the implementation side. Come up with a single BulkProcessor class and go from there.
Everything else requires more synchronization on your side, and I'd always try to keep things simple.
Have you tried reindex from remote? TBH I am not sure if it works from 7.2 to 1.7, but it might be worth a try, as this would not require any programming on your side.
Unfortunately, on my test trials, the reindex api turns out to be super slow because of some limitations about it. I guess the 100mb buffer limit that it uses internally.
To mitigate that, I have to reduce the number of documents per batch and that is making the whole process super slow
The 100mb buffer limit, is it a shared buffer? Or is it per reindex task?
Looks like there were plans to make the buffer size configurable but I don't see it anywhere in the docs Configure buffer limit Github Issue
As reindexing from a remote cluster doesn't parallelize internally with slices like intra-cluster reindexing does, I've found it helpful to launch multiple reindex-from-remote tasks, each one grabbing some partition (e.g., year) of the index. You might be pleasantly surprised how well this approach scales.
If you have experience with the reindex API, could you tell me what does the 2 parameters in the request do
requests_per_second & size .. I hope they are what they sound like but experiments show otherwise
e.g. I had an index with 10000 documents roughly and I provided size of 1000, but the response showed that it only fired 1 bulk request? given the size was 1000, shouldn't it have been 10 bulk request of 1000 each? requests_per_second was -1 in my case (so no throttle)
I've never had to mess with requests_per_second, so I'm not really sure when it would be helpful.
Size is just how many docs get returned per iteration of the underlying scroll. I have experimented with changing this on some scripts and seen a 2x improvement with larger values, so it's worth benchmarking at 1000 and then aiming higher.
For your specific case of 10000 docs and the default size of 1000, this would result in a single scroll that is used 10 times. You do not get parallelization for free when reindexing from remote. You have to partition the work yourself via 10 reindex calls.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.