Bad performance with varying bulk size

Hi!

We're currently having some indexing performance issues with one of our applications using elasticsearch 1.7.

We're writing to one index (16 shards, 21 data nodes) with 10 concurrent threads using the Java API. Bulk sizes depends on size of incoming data and since incoming data varies in size so do the bulk sizes.
As long as the bulks sizes meets our preferred max size of 8000 documents performance is great for all threads. However, as soon as incoming data load decreases for some threads and bulk size strives towards 1000 documents or less the import performance are degraded for all threads. Even for threads that have a bulk sizes of 8000.

Are there some configuration parameters one should think of with the described scenario with varying bulksize to maximise performance or do we need to even out the bulk sizes to have a stable indexing rate?

/Alex

Some more information.

  • Heap for our data nodes are 32GB
  • Our indices are immutable so we toggle refresh rate. -1 before indexing starts and 1 when indexing completes.
  • Replicas are 0 until we're done writing to an index.
  • We're using spinning disks so we have set index.merge.scheduler.max_thread_count to 1.
  • index.translog.flush_threshold_size is set to 1000mb
  • indices.memory.index_buffer_size: 50%
  • indices.store.throttle.type: none (despite that we see some throttling going on every now and then)

We managed to get the speed up by sorting incoming data first, making the bulks more even. That have increased the import speed but it's not stable. It's toggling between very fast (200k-300k docs/sec) to super low (1k-10k doc/s) in short period of time (minutes).

When it's slow we can see that some batches get "stuck" in the "bulk.active" queue for a data node. It's different data node pretty much every time so it doesn't seem to be related to a single data node or machine. We can't see that our system resources are saturated in any way during this and there seems to be no throttling going on (at least according to the logs).

Do you supply doc ids or just let elasticsearch generate new ones? If you supply the IDs there is extra work in checking if the doc already exists - it's fast but essentially introduces a read step to every write.

That's pretty normal, if you look at the volume throughput (megabytes per second) you should see a constant rate. Docs per second are pretty meaningless when doc size varies much.

Bulk requests can be sized by volume to send uniformly sized requests, so I suggest to examine the bulk API.

We do. We generate a doc id using file name (with path) together with the row number from the file the document was taken from. The row number is zero padded. As you can imagine we're getting quite long id's. We can see a lot of disk reads indeed but not enough to be a problem.

We're using the BulkRequestBuilder right now but been considering the BulkProcessor ..

Regarding "index.merge.scheduler.max_thread_count".

We're using spinning disks and have set count to 1 according to the recommendations. However, we got a lot of them and they're pretty much the fatest ones that you come by. Rest of the hardware is also quite beefy. The 7 machines hosting 3 data nodes each got CPUs with 24 cores. Are we still obligated to just have 2+1 merge threads?