How does batch size effect performance in bulk indexing?

Some blogs suggest to tune the batch size for each batch to take ~1s. Why is that?

I would disregard that advice... it's a very strange criteria and I don't really understand why someone would recommend it.

The general advice is to find the largest bulk size that maximizes throughput, without negatively impacting memory usage / garbage collections.

For some clusters that may be 15mb per bulk, other's may be 100mb. It really depends. But basing it on a time latency criteria isn't a good idea.


In our case, when we go from 15k to 25K docs, throughput decreases by 3x, cpu load (4 cores) goes from 360 to 270, and old Gen GC times goes up by 100x. We thought that increasing the index_buffer_size from 10% to 30% would resolve that, but it didn't. Any idea what is causing this behavior? Trying to understand how ES works under the good more than anything.

Yep, it's the old-gen GC that's killing your performance. The bulk needs to sit in memory while it is being split/sent to the various nodes and shards. If you size the bulk too big, it just plops into memory and fills the newgen, which tenures a bunch of stuff prematurely to old-gen, which can trigger old-gen GCs.

You'll notice that I specified _physical_sizes, not number of documents. Going by number of docs is very unreliable... 100 five-byte documents is very different from 100 ten-megabyte documents! You should really be batching based on physical size to find the optimum size.

Increasing the index_buffer_size won't help, as this pressure is coming from the bulk object itself (not from the Lucene indexing process).

Edit: Note, if you have parallel bulk threads running, you need to take that into account too. One bulk of 100mb is different from ten bulks of 100mb :slight_smile:

1 Like