BulkProcessor best practice for heavy ingestion

Faiz_Ahmed_Mushtak_H · January 30, 2020, 7:16pm

We want to reindex documents into our new 7.2 ES cluster from our old 1.7 ES cluster (around 2TB of data)

For indexing purpose, we are going to use the BulkProcessor API. We had three approaches in mind

Have N (configurable) threads reading from non overlapping time frames from 1.7 cluster and have one BulkProcessor Instance with setConcurrentRequests set to N
Have N thread consuming from non overlapping time frames from 1.7 and have N BulkProcessors with setConcurrentRequests as 1 for each
Have 1 thread consuming from 1.7 and have 1 BulkProcessor with setConcurrentRequests as N

I'm planning to keep the rest of the BulkProcessor settings to their defaults only.

My question is, for 1, do you see a lock contention in BulkProcessor when we add BulkRequests to it given that multiple threads may call add of the same bulk processor? From source code, I do see we acquire a lock in add call

To mitigate the possible lock contention, we planned for 2 & 3

With 2, since there will be 1:1 mapping between producer thread and the BulkProcessor, I don't see a lock contention there

With 3, since there is a single producer, I don't see a contention again

I'd definitely experiment various strategies, but i'm kind of running out of time, so just getting a hint of benefits/caveats of the 3 approach would help me start off

spinscale · January 31, 2020, 9:23am

To me, the third solutions looks, as if it has the fewest moving parts on the implementation side. Come up with a single BulkProcessor class and go from there.

Everything else requires more synchronization on your side, and I'd always try to keep things simple.

Have you tried reindex from remote? TBH I am not sure if it works from 7.2 to 1.7, but it might be worth a try, as this would not require any programming on your side.

Faiz_Ahmed_Mushtak_H · January 31, 2020, 11:48am

Reindex API is a life saver!!! I didn't know it would work with 1.7 remote cluster, but it did

Faiz_Ahmed_Mushtak_H · January 31, 2020, 2:04pm

Unfortunately, on my test trials, the reindex api turns out to be super slow because of some limitations about it. I guess the 100mb buffer limit that it uses internally.

To mitigate that, I have to reduce the number of documents per batch and that is making the whole process super slow

The 100mb buffer limit, is it a shared buffer? Or is it per reindex task?
Looks like there were plans to make the buffer size configurable but I don't see it anywhere in the docs Configure buffer limit Github Issue

loren · January 31, 2020, 8:49pm

As reindexing from a remote cluster doesn't parallelize internally with slices like intra-cluster reindexing does, I've found it helpful to launch multiple reindex-from-remote tasks, each one grabbing some partition (e.g., year) of the index. You might be pleasantly surprised how well this approach scales.

Faiz_Ahmed_Mushtak_H · February 1, 2020, 5:36am

Certainly, I'm planning to do the same

If you have experience with the reindex API, could you tell me what does the 2 parameters in the request do

requests_per_second & size .. I hope they are what they sound like but experiments show otherwise

e.g. I had an index with 10000 documents roughly and I provided size of 1000, but the response showed that it only fired 1 bulk request? given the size was 1000, shouldn't it have been 10 bulk request of 1000 each? requests_per_second was -1 in my case (so no throttle)

loren · February 7, 2020, 6:27pm

I've never had to mess with requests_per_second, so I'm not really sure when it would be helpful.

Size is just how many docs get returned per iteration of the underlying scroll. I have experimented with changing this on some scripts and seen a 2x improvement with larger values, so it's worth benchmarking at 1000 and then aiming higher.

For your specific case of 10000 docs and the default size of 1000, this would result in a single scroll that is used 10 times. You do not get parallelization for free when reindexing from remote. You have to partition the work yourself via 10 reindex calls.

system · March 6, 2020, 6:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Alternative bulk indexing implementations? Elasticsearch	10	2362	July 5, 2017
How Can I increase ES's indexing Data speed?Bulk can't achieve it! Elasticsearch	12	1308	July 5, 2017
Bulk processor: Cannot find bulk threads while profiling Elasticsearch	9	1080	July 24, 2019
How to increase indexing speed? Elasticsearch	5	5458	April 18, 2017
BulkProcessor usage is safe? Elasticsearch	6	3312	July 6, 2017

BulkProcessor best practice for heavy ingestion

Related topics