Summary
- Before the problem, the cluster could ingest 500+ records per second.
- When the problem happens, performance drops substantially and suddenly (e.g. 76 index operations taking +20s).
- The cluster recovers after a while, without our intervention.
- We see errors in the logs related to ES rejecting operations (bulk threadpool rejecting task due to reaching 50 queued elements).
- The problem happened a couple of times (with increased frequency) without changes in the code ingesting the documents, or spikes in ingestion volume.
- All metrics look healthy (CPU, JVM pressure, Disk IOPS, etc), with no substantial changes during the outage.
Environment
- Using a AWS managed ES cluster v2.3, with two m2.large instances (no dedicated master node).
- Cluster scaled up later to 4 nodes, each one m4.xlarge instances. Problem still happens, with same or worse intensity.
Ingestion code
- Using Java Jest library (https://github.com/searchbox-io/Jest).
- We pull a batch of records from a stream. Most of the time, size of batch is just one or two records. Sometimes we have a burst of events, generating ~100 records per batch.
- Each batch translates to four Bulk queries, each one run concurrently, and each one hitting a single and different index. Each bulk request has never exceeded 10MB in size.
- Two indices are append-only (i.e. 100% of the inner request append documents, letting the cluster to generate _id field values).
- The other two bulk requests are a mix of upsert and delete operations (where the _id value is managed by the application).
- The rejection affects all writes (i.e. we see this error in all the indices).
Correlation
- We cannot correlate this to any code change (in fact, there was none) or burst of traffic.
- In one occasion we've seen this outage little after a burst of _search requests (all of them hitting the same index). The outage happened again an hour after such burst.
Some debugging...
During the outage, we can confirm that the queues for bulk processing was overloaded:
GET _cat/thread_pool?v=1&h=id,ba,bs,bq,bqs,br,bt,ia,is,iq,iqs,ir,it,sa,ss,sq,sqs,sr,st HTTP/1.1
id ba bs bq bqs br bt ia is iq iqs ir it sa ss sq sqs sr st
4BIT 4 4 5 50 0 fixed 0 0 0 200 0 fixed 0 7 0 1000 0 fixed
0YTF 4 4 44 50 103 fixed 0 0 0 200 0 fixed 0 7 0 1000 0 fixed
mtXJ 4 4 28 50 77 fixed 0 0 0 200 0 fixed 0 7 0 1000 0 fixed
JJfa 4 4 22 50 74 fixed 0 0 0 200 0 fixed 0 7 0 1000 0 fixed
We also see lots of pending tasks related to indexing (see block below). We don't know how bulk processing relates to indexing.
I would imagine the following happens while processing bulk requests (please feel free to confirm or rectify this guess...):
- Each bulk request consumes one active thread from the thread pool for bulk ops in the node that received such request.
- The processing if such bulk task involves forwarding each inner task to the proper node in the cluster, then blocked until all the requests have been processed by.
- Results are collected, merged, returned, and the thread is returned to the pool.
- Other nodes used threads from their own pools while processing the inner requests. E.g. if the bulk request included all index operations, it's fair to expect other nodes in the cluster to start consuming threads from their thread pool dedicated to index operations.
However... when we check the status of thread pools in the nodes, we see only the bulk thread pools to be at full capacity (and with lots of tasks queued, and some rejections after the queuing exceeds size=50). We don't see activity on the other thread pools. We know that we dispatch at most four concurrent bulk requests to the cluster, and not more until all these four are completed.
GET _cat/pending_tasks?v=1 HTTP/1.1
insertOrder timeInQueue priority source
9432 26.1s HIGH put-mapping [cr]
9433 26.1s HIGH put-mapping [cr]
9434 26.1s HIGH put-mapping [cr]
9435 26.1s HIGH put-mapping [cr]
9438 26.1s HIGH put-mapping [cr]
9439 26.1s HIGH put-mapping [cr]
9436 26.1s HIGH put-mapping [cr]
9437 26.1s HIGH put-mapping [cr]
9441 9.8s HIGH put-mapping [cr]
9440 9.8s HIGH put-mapping [cr]
9442 9.8s HIGH put-mapping [cr]
9443 9.8s HIGH put-mapping [cr]
9444 9.8s HIGH put-mapping [cr]
9445 9.8s HIGH put-mapping [cr]
9446 9.8s HIGH put-mapping [cr]
9447 9.8s HIGH put-mapping [cr]
And this is the state of each node:
GET _cat/nodes?h=id,v,j,d,hc,hp,ru,rp,fdp,l&v=1 HTTP/1.1
id v j d hc hp rp fdp l
JJfa 2.3.2 1.8.0_102 15.5gb 4.3gb 54 90 0 0.20
0YTF 2.3.2 1.8.0_102 16gb 3.1gb 39 87 0 1.17
mtXJ 2.3.2 1.8.0_102 15.3gb 3.4gb 42 91 0 0.53
4BIT 2.3.2 1.8.0_102 15.9gb 4.8gb 60 82 0 0.85
Some questions
- Why is that we see lots of pending put-mapping operations queued, and at the same time very low activity in the thread pools for index operations?...
- What other metrics could we check to understand why the cluster becomes painfully slow, suddenly?
- Similar to the question above: what other information could helps us to identify the bottleneck?
Any help or insights will he highly appreciated...
Thanks,