We're currently having some indexing performance issues with one of our applications using elasticsearch 1.7.
We're writing to one index (16 shards, 21 data nodes) with 10 concurrent threads using the Java API. Bulk sizes depends on size of incoming data and since incoming data varies in size so do the bulk sizes.
As long as the bulks sizes meets our preferred max size of 8000 documents performance is great for all threads. However, as soon as incoming data load decreases for some threads and bulk size strives towards 1000 documents or less the import performance are degraded for all threads. Even for threads that have a bulk sizes of 8000.
Are there some configuration parameters one should think of with the described scenario with varying bulksize to maximise performance or do we need to even out the bulk sizes to have a stable indexing rate?
We managed to get the speed up by sorting incoming data first, making the bulks more even. That have increased the import speed but it's not stable. It's toggling between very fast (200k-300k docs/sec) to super low (1k-10k doc/s) in short period of time (minutes).
When it's slow we can see that some batches get "stuck" in the "bulk.active" queue for a data node. It's different data node pretty much every time so it doesn't seem to be related to a single data node or machine. We can't see that our system resources are saturated in any way during this and there seems to be no throttling going on (at least according to the logs).
Do you supply doc ids or just let elasticsearch generate new ones? If you supply the IDs there is extra work in checking if the doc already exists - it's fast but essentially introduces a read step to every write.
That's pretty normal, if you look at the volume throughput (megabytes per second) you should see a constant rate. Docs per second are pretty meaningless when doc size varies much.
Bulk requests can be sized by volume to send uniformly sized requests, so I suggest to examine the bulk API.
We do. We generate a doc id using file name (with path) together with the row number from the file the document was taken from. The row number is zero padded. As you can imagine we're getting quite long id's. We can see a lot of disk reads indeed but not enough to be a problem.
We're using spinning disks and have set count to 1 according to the recommendations. However, we got a lot of them and they're pretty much the fatest ones that you come by. Rest of the hardware is also quite beefy. The 7 machines hosting 3 data nodes each got CPUs with 24 cores. Are we still obligated to just have 2+1 merge threads?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.