Production Cluster Suddenly Crashed Last Night


(Yosi Haran) #1

Hi Guys,

We've been running a production cluster of elasticsearch 1.0.0 with 3 nodes in 3 regions in AWS for about a year now, with about ~5K requests per minute on average.
Last night, although there was no traffic spike, the elasticsearch log started to fill up with the following exception:

org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000)

Shortly after that the CPU of the 3 machines went up to 100% and they became inaccessible.

We restarted them and all is well for now, but we are trying to understand what happened and would greatly appreciate any insight we can get from the members of this forum.
Just to re-iterate, there was no traffic spike.
(We are also waiting to see if there was some network error with amazon, but that wouldn't fully explain the CPU spike).

Thanks!


(Adrien Grand) #2

Do you know if you got this exception as part of a search or indexing operation (the full stack trace could help figure it out)?


(Yosi Haran) #3

It was a searching operation.


(Mark Walkom) #4

How much data and indices + shards in your cluster.


(Yosi Haran) #5

Crash was almost 100% caused because of the AWS downtime: http://mashable.com/2015/06/30/aws-disruption/...
Thanks for the help anyway :smile:


(system) #6