How does batch size effect performance in bulk indexing?

eugene_miretsky · July 19, 2016, 7:33pm

Some blogs suggest to tune the batch size for each batch to take ~1s. Why is that?

polyfractal · July 19, 2016, 7:46pm

I would disregard that advice... it's a very strange criteria and I don't really understand why someone would recommend it.

The general advice is to find the largest bulk size that maximizes throughput, without negatively impacting memory usage / garbage collections.

For some clusters that may be 15mb per bulk, other's may be 100mb. It really depends. But basing it on a time latency criteria isn't a good idea.

eugene_miretsky · July 19, 2016, 7:51pm

Thanks!

In our case, when we go from 15k to 25K docs, throughput decreases by 3x, cpu load (4 cores) goes from 360 to 270, and old Gen GC times goes up by 100x. We thought that increasing the index_buffer_size from 10% to 30% would resolve that, but it didn't. Any idea what is causing this behavior? Trying to understand how ES works under the good more than anything.

polyfractal · July 19, 2016, 7:57pm

Yep, it's the old-gen GC that's killing your performance. The bulk needs to sit in memory while it is being split/sent to the various nodes and shards. If you size the bulk too big, it just plops into memory and fills the newgen, which tenures a bunch of stuff prematurely to old-gen, which can trigger old-gen GCs.

You'll notice that I specified _physical_sizes, not number of documents. Going by number of docs is very unreliable... 100 five-byte documents is very different from 100 ten-megabyte documents! You should really be batching based on physical size to find the optimum size.

Increasing the index_buffer_size won't help, as this pressure is coming from the bulk object itself (not from the Lucene indexing process).

Edit: Note, if you have parallel bulk threads running, you need to take that into account too. One bulk of 100mb is different from ten bulks of 100mb

Topic		Replies	Views
Elasticsearch bulk size/performance Elasticsearch	2	19231	July 5, 2017
Indexing Performance vs Document Size Elasticsearch	4	1467	July 5, 2017
Tuning indices.memory.max_index_buffer_size for indexing throughput Elasticsearch	1	2620	July 5, 2017
Bad performance with varying bulk size Elasticsearch	8	1577	July 5, 2017
Bulk indexing slow down when data amount increase Elasticsearch	6	2956	July 6, 2017

How does batch size effect performance in bulk indexing?

Related topics