BulkProcessor best practice for heavy ingestion

We want to reindex documents into our new 7.2 ES cluster from our old 1.7 ES cluster (around 2TB of data)

For indexing purpose, we are going to use the BulkProcessor API. We had three approaches in mind

  1. Have N (configurable) threads reading from non overlapping time frames from 1.7 cluster and have one BulkProcessor Instance with setConcurrentRequests set to N

  2. Have N thread consuming from non overlapping time frames from 1.7 and have N BulkProcessors with setConcurrentRequests as 1 for each

  3. Have 1 thread consuming from 1.7 and have 1 BulkProcessor with setConcurrentRequests as N

I'm planning to keep the rest of the BulkProcessor settings to their defaults only.

My question is, for 1, do you see a lock contention in BulkProcessor when we add BulkRequests to it given that multiple threads may call add of the same bulk processor? From source code, I do see we acquire a lock in add call

To mitigate the possible lock contention, we planned for 2 & 3

With 2, since there will be 1:1 mapping between producer thread and the BulkProcessor, I don't see a lock contention there

With 3, since there is a single producer, I don't see a contention again

I'd definitely experiment various strategies, but i'm kind of running out of time, so just getting a hint of benefits/caveats of the 3 approach would help me start off

To me, the third solutions looks, as if it has the fewest moving parts on the implementation side. Come up with a single BulkProcessor class and go from there.

Everything else requires more synchronization on your side, and I'd always try to keep things simple.

Have you tried reindex from remote? TBH I am not sure if it works from 7.2 to 1.7, but it might be worth a try, as this would not require any programming on your side.

Reindex API is a life saver!!! I didn't know it would work with 1.7 remote cluster, but it did

1 Like

Unfortunately, on my test trials, the reindex api turns out to be super slow because of some limitations about it. I guess the 100mb buffer limit that it uses internally.

To mitigate that, I have to reduce the number of documents per batch and that is making the whole process super slow

The 100mb buffer limit, is it a shared buffer? Or is it per reindex task?
Looks like there were plans to make the buffer size configurable but I don't see it anywhere in the docs Configure buffer limit Github Issue

As reindexing from a remote cluster doesn't parallelize internally with slices like intra-cluster reindexing does, I've found it helpful to launch multiple reindex-from-remote tasks, each one grabbing some partition (e.g., year) of the index. You might be pleasantly surprised how well this approach scales.

Certainly, I'm planning to do the same

If you have experience with the reindex API, could you tell me what does the 2 parameters in the request do

requests_per_second & size .. I hope they are what they sound like but experiments show otherwise

e.g. I had an index with 10000 documents roughly and I provided size of 1000, but the response showed that it only fired 1 bulk request? given the size was 1000, shouldn't it have been 10 bulk request of 1000 each? requests_per_second was -1 in my case (so no throttle)

I've never had to mess with requests_per_second, so I'm not really sure when it would be helpful.

Size is just how many docs get returned per iteration of the underlying scroll. I have experimented with changing this on some scripts and seen a 2x improvement with larger values, so it's worth benchmarking at 1000 and then aiming higher.

For your specific case of 10000 docs and the default size of 1000, this would result in a single scroll that is used 10 times. You do not get parallelization for free when reindexing from remote. You have to partition the work yourself via 10 reindex calls.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.