Production Cluster Suddenly Crashed Last Night

Yosi_Haran · July 1, 2015, 8:18am

Hi Guys,

We've been running a production cluster of elasticsearch 1.0.0 with 3 nodes in 3 regions in AWS for about a year now, with about ~5K requests per minute on average.
Last night, although there was no traffic spike, the elasticsearch log started to fill up with the following exception:

org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000)

Shortly after that the CPU of the 3 machines went up to 100% and they became inaccessible.

We restarted them and all is well for now, but we are trying to understand what happened and would greatly appreciate any insight we can get from the members of this forum.
Just to re-iterate, there was no traffic spike.
(We are also waiting to see if there was some network error with amazon, but that wouldn't fully explain the CPU spike).

Thanks!

jpountz · July 1, 2015, 11:29am

Do you know if you got this exception as part of a search or indexing operation (the full stack trace could help figure it out)?

Yosi_Haran · July 1, 2015, 11:41am

It was a searching operation.

warkolm · July 2, 2015, 3:12am

How much data and indices + shards in your cluster.

Yosi_Haran · July 2, 2015, 3:05pm

Crash was almost 100% caused because of the AWS downtime: http://mashable.com/2015/06/30/aws-disruption/...
Thanks for the help anyway

Topic		Replies	Views
Sudden spike then total failure Elasticsearch	3	497	March 30, 2018
Rejected execution (queue capacity 1000) Elasticsearch	2	26917	July 18, 2017
Troubleshooting cluster wide performance slow downs Elasticsearch	1	430	July 6, 2017
Queue capacity Elasticsearch	4	829	July 6, 2017
Rejected execution Elasticsearch	9	6285	August 11, 2020

Production Cluster Suddenly Crashed Last Night

Related topics