How to bump elasticsearch 'processors' setting in order to increase thread_pool.bulk.size?

Chris_Bedford · September 1, 2017, 5:25am

Christian & Mark -
thanks very much for your advice and comments !

Responding (partially) to Christian's latest comments

#1)
I will act on your advice regarding how high cardinality fields could skew the benchmark results away
from what would be observed in the actual production system. Makes perfect sense. So
I will modify my Elasticsearch REST adapter to ndbench to have more realistic data (real words, not random gibberish).

#2)
I looked into that and got similar results even when I let Elasticsearch generate the ID instead of supplying
them as part of the bulk indexing request.

We are seeing a dramatic rise in reads as data grows. I am guessing a key factor is segment merges and I am going
to look at tuning the segment merge phase (.. if you have a recommendation as to which merge strategy is best
for a use case that generates 60TB per index per day, please let us know !)

(graph of disk reads/writes is attached)

#3)
Is the 60TB of data spread across multiple indices or is it just the one currently being written to?

    just one

looks like you may be having very large shards,

yes... tried 360 shards and 120. 120 has 4% better indexing throughput and resultant shard size after 60TB of primary 
index data is 500GB.

At this scale you may want to look into using the rollover API.

yes. we will definitely look at breaking up our monster index into maybe 24 parts by hour.. need to make sure 
we can still use kibana against it and can get that to work as it does now.

Do you get the same drop in indexing speed if you let Elasticsearch assign the document ids?

Surprisingly, this made no difference for us.

#4)
Our production system actually uses 1 replica, not 2. I dropped replicas to 1 in the benchmark, but it did not
seem to make much difference in total number of documents indexed per second per node. So, yes indeed.. our
experience is exactly as you say: "Primary and replica shards basically do the same work for indexing".

will reply to other points tomorrow.

thnx
-chris

Topic		Replies	Views
Improving Bulk Indexing Elasticsearch	12	4593	July 6, 2017
Understanding Threadpools Elasticsearch	7	464	July 6, 2017
Issue Indexing 50mil Docs via Bulk API Elasticsearch	23	2452	July 5, 2017
Inserts get slower when index become large Elasticsearch	10	475	July 6, 2017
Confused about tuning Elasticsearch	7	737	July 6, 2017

How to bump elasticsearch 'processors' setting in order to increase thread_pool.bulk.size?

Related topics