Cluster suddenly starts rejecting bulk requests due to thread pool queue exhaustion

Summary

  • Before the problem, the cluster could ingest 500+ records per second.
  • When the problem happens, performance drops substantially and suddenly (e.g. 76 index operations taking +20s).
  • The cluster recovers after a while, without our intervention.
  • We see errors in the logs related to ES rejecting operations (bulk threadpool rejecting task due to reaching 50 queued elements).
  • The problem happened a couple of times (with increased frequency) without changes in the code ingesting the documents, or spikes in ingestion volume.
  • All metrics look healthy (CPU, JVM pressure, Disk IOPS, etc), with no substantial changes during the outage.

Environment

  • Using a AWS managed ES cluster v2.3, with two m2.large instances (no dedicated master node).
  • Cluster scaled up later to 4 nodes, each one m4.xlarge instances. Problem still happens, with same or worse intensity.

Ingestion code

  • Using Java Jest library (https://github.com/searchbox-io/Jest).
  • We pull a batch of records from a stream. Most of the time, size of batch is just one or two records. Sometimes we have a burst of events, generating ~100 records per batch.
  • Each batch translates to four Bulk queries, each one run concurrently, and each one hitting a single and different index. Each bulk request has never exceeded 10MB in size.
  • Two indices are append-only (i.e. 100% of the inner request append documents, letting the cluster to generate _id field values).
  • The other two bulk requests are a mix of upsert and delete operations (where the _id value is managed by the application).
  • The rejection affects all writes (i.e. we see this error in all the indices).

Correlation

  • We cannot correlate this to any code change (in fact, there was none) or burst of traffic.
  • In one occasion we've seen this outage little after a burst of _search requests (all of them hitting the same index). The outage happened again an hour after such burst.

Some debugging...

During the outage, we can confirm that the queues for bulk processing was overloaded:

GET _cat/thread_pool?v=1&h=id,ba,bs,bq,bqs,br,bt,ia,is,iq,iqs,ir,it,sa,ss,sq,sqs,sr,st HTTP/1.1
 
id   ba bs bq bqs  br bt    ia is iq iqs ir it    sa ss sq  sqs sr st
4BIT  4  4  5  50   0 fixed  0  0  0 200  0 fixed  0  7  0 1000  0 fixed
0YTF  4  4 44  50 103 fixed  0  0  0 200  0 fixed  0  7  0 1000  0 fixed
mtXJ  4  4 28  50  77 fixed  0  0  0 200  0 fixed  0  7  0 1000  0 fixed
JJfa  4  4 22  50  74 fixed  0  0  0 200  0 fixed  0  7  0 1000  0 fixed

We also see lots of pending tasks related to indexing (see block below). We don't know how bulk processing relates to indexing.

I would imagine the following happens while processing bulk requests (please feel free to confirm or rectify this guess...):

  • Each bulk request consumes one active thread from the thread pool for bulk ops in the node that received such request.
  • The processing if such bulk task involves forwarding each inner task to the proper node in the cluster, then blocked until all the requests have been processed by.
  • Results are collected, merged, returned, and the thread is returned to the pool.
  • Other nodes used threads from their own pools while processing the inner requests. E.g. if the bulk request included all index operations, it's fair to expect other nodes in the cluster to start consuming threads from their thread pool dedicated to index operations.

However... when we check the status of thread pools in the nodes, we see only the bulk thread pools to be at full capacity (and with lots of tasks queued, and some rejections after the queuing exceeds size=50). We don't see activity on the other thread pools. We know that we dispatch at most four concurrent bulk requests to the cluster, and not more until all these four are completed.

GET _cat/pending_tasks?v=1 HTTP/1.1
 
insertOrder timeInQueue priority source
       9432       26.1s HIGH     put-mapping [cr]
       9433       26.1s HIGH     put-mapping [cr]
       9434       26.1s HIGH     put-mapping [cr]
       9435       26.1s HIGH     put-mapping [cr]
       9438       26.1s HIGH     put-mapping [cr]
       9439       26.1s HIGH     put-mapping [cr]
       9436       26.1s HIGH     put-mapping [cr]
       9437       26.1s HIGH     put-mapping [cr]
       9441        9.8s HIGH     put-mapping [cr]
       9440        9.8s HIGH     put-mapping [cr]
       9442        9.8s HIGH     put-mapping [cr]
       9443        9.8s HIGH     put-mapping [cr]
       9444        9.8s HIGH     put-mapping [cr]
       9445        9.8s HIGH     put-mapping [cr]
       9446        9.8s HIGH     put-mapping [cr]
       9447        9.8s HIGH     put-mapping [cr]

And this is the state of each node:

GET _cat/nodes?h=id,v,j,d,hc,hp,ru,rp,fdp,l&v=1 HTTP/1.1

id   v     j              d    hc hp rp fdp    l
JJfa 2.3.2 1.8.0_102 15.5gb 4.3gb 54 90   0 0.20
0YTF 2.3.2 1.8.0_102   16gb 3.1gb 39 87   0 1.17
mtXJ 2.3.2 1.8.0_102 15.3gb 3.4gb 42 91   0 0.53
4BIT 2.3.2 1.8.0_102 15.9gb 4.8gb 60 82   0 0.85

Some questions

  1. Why is that we see lots of pending put-mapping operations queued, and at the same time very low activity in the thread pools for index operations?...
  2. What other metrics could we check to understand why the cluster becomes painfully slow, suddenly?
  3. Similar to the question above: what other information could helps us to identify the bottleneck?

Any help or insights will he highly appreciated...

Thanks,

Hey,

is your mapping growing? The pending tasks seem to add new fields to the mapping? You should make sure you dont reach a mapping explosion (an arbitrary number of fields in an index). You could also try to keep your mappings small and configure them upfront, so there are no mapping updates needed while indexing.

It seems (you could measure index times), that bulk indexing is blocked and thus requests are piling up, but that is just an assumption from here.

--Alex

Thanks Alex. Indeed, we've identified a case where a document insert is creating variations in data which is adding new mappings. The current mapping output (GET /{index}/_mapping/{type}) output is ~4MB. Do you have some advice on how much is too much?...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.