Bad performance with varying bulk size

mcrandy · October 17, 2016, 7:21pm

Hi!

We're currently having some indexing performance issues with one of our applications using elasticsearch 1.7.

We're writing to one index (16 shards, 21 data nodes) with 10 concurrent threads using the Java API. Bulk sizes depends on size of incoming data and since incoming data varies in size so do the bulk sizes.
As long as the bulks sizes meets our preferred max size of 8000 documents performance is great for all threads. However, as soon as incoming data load decreases for some threads and bulk size strives towards 1000 documents or less the import performance are degraded for all threads. Even for threads that have a bulk sizes of 8000.

Are there some configuration parameters one should think of with the described scenario with varying bulksize to maximise performance or do we need to even out the bulk sizes to have a stable indexing rate?

/Alex

mcrandy · October 17, 2016, 8:06pm

Some more information.

Heap for our data nodes are 32GB
Our indices are immutable so we toggle refresh rate. -1 before indexing starts and 1 when indexing completes.
Replicas are 0 until we're done writing to an index.
We're using spinning disks so we have set index.merge.scheduler.max_thread_count to 1.
index.translog.flush_threshold_size is set to 1000mb
indices.memory.index_buffer_size: 50%
indices.store.throttle.type: none (despite that we see some throttling going on every now and then)

mcrandy · October 21, 2016, 4:20pm

We managed to get the speed up by sorting incoming data first, making the bulks more even. That have increased the import speed but it's not stable. It's toggling between very fast (200k-300k docs/sec) to super low (1k-10k doc/s) in short period of time (minutes).

When it's slow we can see that some batches get "stuck" in the "bulk.active" queue for a data node. It's different data node pretty much every time so it doesn't seem to be related to a single data node or machine. We can't see that our system resources are saturated in any way during this and there seems to be no throttling going on (at least according to the logs).

Mark_Harwood · October 21, 2016, 4:35pm

Do you supply doc ids or just let elasticsearch generate new ones? If you supply the IDs there is extra work in checking if the doc already exists - it's fast but essentially introduces a read step to every write.

jprante · October 21, 2016, 4:55pm

That's pretty normal, if you look at the volume throughput (megabytes per second) you should see a constant rate. Docs per second are pretty meaningless when doc size varies much.

Bulk requests can be sized by volume to send uniformly sized requests, so I suggest to examine the bulk API.

mcrandy · October 21, 2016, 5:11pm

We do. We generate a doc id using file name (with path) together with the row number from the file the document was taken from. The row number is zero padded. As you can imagine we're getting quite long id's. We can see a lot of disk reads indeed but not enough to be a problem.

mcrandy · October 21, 2016, 5:11pm

We're using the BulkRequestBuilder right now but been considering the BulkProcessor ..

mcrandy · October 21, 2016, 5:23pm

Regarding "index.merge.scheduler.max_thread_count".

We're using spinning disks and have set count to 1 according to the recommendations. However, we got a lot of them and they're pretty much the fatest ones that you come by. Rest of the hardware is also quite beefy. The 7 machines hosting 3 data nodes each got CPUs with 24 cores. Are we still obligated to just have 2+1 merge threads?

Topic		Replies	Views
Elasticsearch bulk size/performance Elasticsearch	2	19118	July 5, 2017
Bulk indexing slow down when data amount increase Elasticsearch	6	2948	July 6, 2017
Bulk Indexing Rate Elasticsearch	4	552	April 18, 2018
Rapidly Degrading Bulk Indexing Performance Elasticsearch	7	368	July 6, 2017
Horizontal scaling of indexing Elasticsearch	8	1996	July 5, 2017

Bad performance with varying bulk size

Related topics