How to bump elasticsearch 'processors' setting in order to increase thread_pool.bulk.size?

We've done pretty extensive performance analysis and found that our write heavy-Elasticsearch
5.5.1-based application is under-utilizing CPU resources (load average is about same as number of cores,
and we are 25% idle across all nodes). Since we are not bottlenecked on I/O or network,
we decided to increase the number of threads that the /hot_threads API flagged as the
busiest: namely, the ones in the bulk indexing thread pool.

However, the only way to do increase the size of this pool is to increase the
number of processors, which seems to be a rather tricky task,
with not much associated documentation (that I could find at least.)

The ES documentation says:

the maximum size of the bulk thread pool is 1 + number-of-available-processors (which is
auto-detected and can be overridden.)

However, the docs say that overriding the processors setting is

"an expert-level use-case and there’s a lot more involved than just
setting the processors setting as there are other considerations like
changing the number of garbage collector threads, pinning processes to
cores, etc."

I am hoping there is a nice guide or tutorial that will walk me through all the nuances and gotchas of
changing the setting for 'processors'. I did some searching but nothing jumped out for me.

thanks in advance !

<< also asked on stack overflow... if someone gives me an answer here I will definitely propagate that answer to stack overflow for benefit of people how just read that>>

How many CPU cores/threads do you have per host and what are you thinking of increasing the applicable Elasticsearch setting to?

Hi. We are using amazon AMI's in AWS cloud. instance type is i2_2XL which
has 8 processors.
[ via 'nproc' and also: cat /proc/cpuinfo | grep processor | wc -l]

I want to increase bulk indexing thread pool size by two, so i am now
getting ready to
run a test based on a cluster in which all nodes have processors=10 (not
8), and the
bulk indexing thread pool size also set to 10.

I'm crossing my fingers.. But since there were those warnings related to
the setting of processors that I don't know how to deal with, I'm a bit
nervous as to the result ;^)

How are you indexing into the cluster? What indexing throughput are you seeing? What is the size and complexity of your documents? What is your bulk size and how many concurrent indexing threads/processes are you using? How many shards are you actively indexing into?

Thanks for your reply. answers below...

HOW ARE YOU INDEXING INTO THE CLUSTER?

    using this open source benchmarking utility:
        https://github.com/Netflix/ndbench
        with a REST API adapter that I wrote for my current client.


        30 instances of NDbench clients using bulk API w/ 2000 docs per

request

CLUSTER:

120 data nodes
    in 1 AWS region  (40 nodes per availability zone)
3 dedicated masters  (each in different AZ)


datanode instance type is i2_2XL which has 8 processors.
master node instance type is r3_XL

WHAT IS THE SIZE AND COMPLEXITY OF YOUR DOCUMENTS?

documents have the structure shown below in SAMPLE DOC
    string contents are randomized
    id's are provided for each document
        [ we found that >NOT< supplying the ID did not lead to

throughput gains - surprising !]

WHAT IS YOUR BULK SIZE

    2K

HOW MANY CONCURRENT INDEXING THREADS/PROCESSES ARE YOU USING?

bulk threads      10
indexing threads  10

        Shortly, I am planning to dial everything but bulk thread pool

size back down to what it was before
I bumped up 'processors'.. That is 8.

HOW MANY SHARDS ARE YOU ACTIVELY INDEXING INTO?
360

3 replicas

INDEXING THROUGHPUT

Initially about 7K docs indexed per second per node.

After 60GB of data indexed (just considering primary documents, not

replicas) throughput falls to
about 4.5K docs/indexed per second per node.

SAMPLE DOC
(..except each string is padded equivalently long enough to make the
entire payload be 2,300 bytes)

{"object0":{"text0":"hello","long0":1,"keyword0":"hello"},"keyword3":"hello","keywords":"hello","keyword1":"hello","keyword2":"hello","nested0":{"text0":"hello","long0":1,"keyword0":"hello"},"date2":1504081148367,"date1":"Aug
30, 2017 1:19:08
AM","long2":1,"long3":1,"long1":1,"isBulkWrite":"true","text1":"hello","text2":"hello","long4":1}

I have a few additional questions around the information you provided:

  1. How many connections do your NDbench clients use concurrently to index into the cluster? Are connections distributed evenly across nodes in the cluster?
  2. How many indices are you indexing into? How many primary and replica shards for each? How many indices do you expect to index into in production?
  3. You mentioned that indexing throughput slows down after 60GB has been indexed. Is that in total, per node, per index or per shard?
  4. Do the indexing rates mentioned per node include indexing into replica shards or is it just documents indexed into primary shards? How is this measured?
  5. Are you using nested documents? If so, how many nested elements are there on average per document?
  6. Why are you padding data up to a specific length?

HOW MANY CONNECTIONS DO YOUR NdBENCH CLIENTS USE CONCURRENTLY TO INDEX INTO THE CLUSTER?

each of the 30 NDB bench clients has 50 writer threads.
    on start-up they attempt requests to target cluster at a rate of 20 requests per second,
    ramping up to 40 requests per second over 3 minutes as long as error rate is below 10%.  
    But before I even get to 20 requests per second I start getting time outs on almost all the ndbench clients.
    So the effective request rate is much lower than 20 per second per client.

        Currently i am ignoring these timeouts. I am simulating a high load where such timeouts could happen 
        in production.

        But if this could be causing a performance issue on the 
        server I would go back and try to make my benchmark client back off a little more so to minimize
        occurence of time outs.

ARE CONNECTIONS DISTRIBUTED EVENLY ACROSS NODES IN THE CLUSTER?

yes. as described above.  Each client behaves in the same way.

HOW MANY INDICES ARE YOU INDEXING INTO?

1

HOW MANY PRIMARY AND REPLICA SHARDS FOR EACH?

360 primary shards, 2 replicas

HOW MANY INDICES DO YOU EXPECT TO INDEX INTO IN PRODUCTION?

1 index being actively written to for current day's log data.

4 previous days index 

all will have light search activity.. with the main activity being centered in the most recent index.

YOU MENTIONED THAT INDEXING THROUGHPUT SLOWS DOWN AFTER 60gb HAS BEEN INDEXED.
IS THAT IN TOTAL, PER NODE, PER INDEX OR PER SHARD?

My bad.  I should have written  that slow down occurs after 60 TERABYTES  of data are written
to the primary shards (across whole cluster.)

DO THE INDEXING RATES MENTIONED PER NODE INCLUDE INDEXING INTO REPLICA SHARDS OR
IS IT JUST DOCUMENTS INDEXED INTO PRIMARY SHARDS? HOW IS THIS MEASURED?

rate given for total number of documents: primary copies as well as  replicas

for details, please see THROUGHPUT MEASUREMENT SCRIPT, below

ARE YOU USING NESTED DOCUMENTS?

    yes

IF SO, HOW MANY NESTED ELEMENTS ARE THERE ON AVERAGE PER DOCUMENT?

    please see FULL EXAMPLE OF DOCUMENT, below.

WHY ARE YOU PADDING DATA UP TO A SPECIFIC LENGTH?

    To get to the total number of bytes per document to be 2,300, which is the 
    average of the production cluster whose load the benchmark is trying to simulate. 

FULL EXAMPLE OF DOCUMENT

      "object0" : {
        "text0" : "1YJj9HJ0goMOvrXQxkNEAX68aGP1lZ5KjnRbxerpG2Dga0tz9b60FeHztdrcmPh0gzRqVCLjLBkcNO3v4lT4iI5wMs5vPsP4EsqqHV4eeMMdlHgd5AVyWSaOXDNpiCyFrMsz1geu4xySKBT4B5NuA2HhlNFcBbnekPbuhAgWmLcCafhHKqByQWlclq8ob5bVUesKyVGbWOTGzAbi9CJLFBSSpMIXSLb67uoatQ",
        "long0" : 138,
        "keyword0" : "fdYGTeeYwEWn0CEt8aindoGHrNPYe9TWyqTMPMZIlS9zSyLSAey1Frb5mMmJclQuSwXsPYOHm7I1wTU0SOnKBUHLm0vVyFcyzyF4BTrHRARJ1PRF3o7gMPdVVM8NYAtMCo2g1zYp9c915EbPBNL93DgEeEJHZDFGhrVo4riu7eQrKM1VWiTjQSI5qKAL8hmW1kE8yzHH520B4i8XIm48XrmHFQLdDK1tznhrUL"
      },
      "keyword3" : "uJcQ3i6ugx3sM5dntK8oYp51NCyGFI0mRdK8qXQaRYltjSVUxue2XrhquoHeIAhbG0hpF9zQPx3SVQIrCbs55dLQD1UuFfRnhwxHQadh0T74c0qcBvdzJVWHmcIB2gzMLJFHpBvQ8j7HaovDWTn78cUfDSutpTHFF60KBTVLaNvOyR2mK8WpbitWkTVraWHL6ExDaPgKIZi4v0Ch1XxhLBV94PyZ274aHMOYPl",
      "keywords" : "WodASGAIhVgQBDCqibR5qd30jnvBXVg6icUAU3OJvhNdr6qd84XKCFDz4xNMghTilIHuZj6kyF9yNl9sjvz6mdlzTdjKMV4J5yxQejhlzkQEsyWxJtlbeDTgjSrbWcWYW91xlOVMELqxoz3opFo88Gs6IANWr3Ezh9S4zuAcwRtpfqZalvXq56mE8CHtZ0DG8i57r7Bq2cAciBLpdca1R1WsxZ9oqb5yBRVnhm",
      "keyword1" : "4pYUeclgzkUdy9KVh4m9nIEuCLHIDc5ldoayJhXsoL8tln1NkC68ifMDxAqatIeJu7XpqPzfn7suTmHquJTxYouTmkvrkbsqslz9HoyDt6Rc3oGyTwRWgxiky5VHLJ9kn4me4x1P8xuHkovW5XsbET9hmUW9sVnooTl1sAuzCpQmoRdVep1xspHoPtrcXIX5ZwrOefOUlHLuIeDBFEed8sUSjvMWL4yL5GWxCG",
      "keyword2" : "8YK5nKI5pL7DtxMaZdrb3XQBlKnVwXxaHiKdujq5oYLwvZImKEKssF5jLJbbte9Y6UCUV8uZcSXu4scnka8fZwL0g20I639SDmLUkM6qch7sbfb7oEcsPyd33G81XMALHAibVX9vJHN7tFp9HIl7s8a68QmOqw5L4aXgGK04qVyFnHmk0uyftSEocBLkT5rDwpdMtyGjTkDPmCtF1YJrKwVXTjsFgu1peRe7Qq",
      "nested0" : {
        "text0" : "ygxwDCokFJmhiJI4ZpXx8urIN8n1nahqeehvlMAcm4nuzLNp3WvBG4tQDk7RSMIJHM8t5DS0Oeb70FoqSj2w2Sh5LzNCGIMcviHjdQkPctNDETYHr2SuKEJkLvubj6aYZSkyQcDhSZY57nzkduHJHOtLCWDwvHgsJULvQFSJTJcdml2NzzwMKW8CA4eF600746H5tO46tRYG0GRpAbCnh0VDnmF4zMH9s2Yn0r",
        "long0" : 480,
        "keyword0" : "w3XVJYIO7P1wBZxIKZ5NMEqTSvnvUadEEGRxaoRDkSBOYRsJ9OwTU41rA8JQnQkAqQlA0mOG5xJ6ca88rGifkq6R3alLy3tWcxBFWaTRq3z9Kfrek6S377yV9IglEzohzEjV5ufS0rbteUZSmEWnkOrAYXqnZo8Sq0SR2dY9uK0W60OkaDqTgD22QHRtJkRvvRZv2p8STlM13MqnG1p9vI3ppaeaa3YyRZ1jKC"
      },
      "date2" : 1504118461705,
      "date1" : "Aug 30, 2017 6:41:01 PM",
      "long2" : 134,
      "long3" : 557,
      "long1" : 221,
      "isBulkWrite" : "true",
      "text1" : "2fC33x3oVXwE6XxxppzS43CtwymjAS6UlOEL1vpOAk8Vn2spzr4hz3GvCopgafMw2FotzzQn5nZRwQBb6wOM76GGyz45egHt1iDHRnLbmaAR1Tj6f6L3VpH05p0HZCuaHC8xnJ5Ftd03c5Jbw18zbp86fdy3otRGDOVLqBRM8l5Ysb0wxeQdtelbLabl2FY9JDmpGDaYCiSqBEFxxdHLlssz6OmxSdtaiLR1w4",
      "text2" : "pBoDvylOlngTT2huceIh2eKT3fo0EJ7BnRBwkMs4T3y6Nr9UnWrfhTjcFbU55CJJCseNkLXK5YmFilpvJ6sCrKnJxSTjNsh8otAtaWMHhWoTxmAqA0LeGjbebWlo7ScuoTHSjbIiJWkXQiVovBZ974l71thK1gFeODLTzEhBcSK5t6yvFvHNXgsRxffOqjPloINmXLrYUSrflOICsg8GJt7OAVS0zJ6xtgfzbK",
      "long4" : 555
    }

THROUGHPUT MEASUREMENT SCRIPT

SCRIPT (calc_rate.sh)

    countUrl=$1
    numReplicas=$2
    numDataNodes=$3
    samplePeriodSeconds=$4

    beginCount=` curl $countUrl | jq '.indices.ndbench_index.primaries.indexing.index_total' `
    sleep $samplePeriodSeconds
    endCount=` curl $countUrl | jq '.indices.ndbench_index.primaries.indexing.index_total' `

    perSecondIncrease=`python -c "print ($endCount - $beginCount) / ($samplePeriodSeconds*1.0)"`
    echo document count growth rate per second is $perSecondIncrease

    #   multiply increase  in number of documents by R which is: 
    #      replica count + 1 (where the 'plus 1' accounts for the original doc)
    #
    R=`python -c "print ($numReplicas + 1)"`

    totDocsGrowthPerSec=`python -c "print ($R * $perSecondIncrease)"`

    echo document count growth rate per second including replicas is $totDocsGrowthPerSec
    indexRatePerNode=`python -c "print ($totDocsGrowthPerSec*1.0 / $numDataNodes )"`
    echo indexRatePerNode is $indexRatePerNode

EXAMPLE OF HOW SCRIPT IS CALLED

    dataNodes=120
    samplePeriod=120
    bash ~/dev/scripts/calc_rate.sh   http://host:<port>/_stats?pretty 2 $dataNodes $samplePeriod"

By the way.. to give you more context, we found this current
performance issue after we got passed this
issue that you guys very kindly helped us with:

https://github.com/elastic/elasticsearch/issues/26293

You can use the quote and code formatting buttons on your posts, it makes it a bit easier instead of rewriting the quoted questions in caps :slight_smile:

ah ok. will do !

1 Like

I have a few comments:

  1. I always recommend benchmarking with as realistic data as possible. The example data you provided however looks extremely artificial. I am not sure how much this will affect how results of this benchmark correlates to real life performance. As far as I can see all keyword fields appear to have unique/high-cardinality values and text fields do not contain any realistic text at all. Padding small documents with space in order to get a specific size to send over the wire will also not necessarily give representative results.

  2. The fact that you see indexing performance slow down with increased data volumes could be due to increased merging pressure and/or the fact that you are using externally generated IDs, especially if you have selected an inefficient identifier. This is further explained here.

  3. Is the 60TB of data spread across multiple indices or is it just the one currently being written to? If it is just the current one it looks like you may be having very large shards, which could affect performance. At this scale you may want to look into using the rollover API. Do you get the same drop in indexing speed if you let Elasticsearch assign the document ids?

  4. Primary and replica shards basically do the same work for indexing, so by having 2 replicas instead of 1 the cluster does 50% more work. Having 2 replicas does however improve resiliency.

  5. Given the size of your cluster, most of the data received by a coordinating node will need to be sent to other nodes. The fact that you have 2 replicas configured will also increase the amount of network traffic. As you are indexing into a large number of shards, each bulk request of 2000 documents will be broken up to small batches for each primary shard. It is therefore possible that network traffic could be limiting your performance. It might be interesting to see how performance varies if you were to partition the main index into several smaller ones (e.g. 9 indices of 40 primary shards and 2 replicas) and allocate each benchmarking client to write to just one of these indices.

  6. Although your documents have a nested structure, it does not seem like they would benefit from nested mappings (which would generate multiple documents behind the scenes). Do you have any nested data specified in your mappings?

  7. Have you gone through the following optimizations?

Christian & Mark -
thanks very much for your advice and comments !

Responding (partially) to Christian's latest comments

#1)
I will act on your advice regarding how high cardinality fields could skew the benchmark results away
from what would be observed in the actual production system. Makes perfect sense. So
I will modify my Elasticsearch REST adapter to ndbench to have more realistic data (real words, not random gibberish).

#2)
I looked into that and got similar results even when I let Elasticsearch generate the ID instead of supplying
them as part of the bulk indexing request.

We are seeing a dramatic rise in reads as data grows. I am guessing a key factor is segment merges and I am going
to look at tuning the segment merge phase (.. if you have a recommendation as to which merge strategy is best
for a use case that generates 60TB per index per day, please let us know !)

(graph of disk reads/writes is attached)

#3)
Is the 60TB of data spread across multiple indices or is it just the one currently being written to?

    just one

looks like you may be having very large shards,

yes... tried 360 shards and 120. 120 has 4% better indexing throughput and resultant shard size after 60TB of primary 
index data is 500GB.

At this scale you may want to look into using the rollover API.

yes. we will definitely look at breaking up our monster index into maybe 24 parts by hour.. need to make sure 
we can still use kibana against it and can get that to work as it does now.

Do you get the same drop in indexing speed if you let Elasticsearch assign the document ids?

Surprisingly, this made no difference for us.

#4)
Our production system actually uses 1 replica, not 2. I dropped replicas to 1 in the benchmark, but it did not
seem to make much difference in total number of documents indexed per second per node. So, yes indeed.. our
experience is exactly as you say: "Primary and replica shards basically do the same work for indexing".

will reply to other points tomorrow.

thnx
-chris

image

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.