Production Cluster Suddenly Crashed Last Night

Hi Guys,

We've been running a production cluster of elasticsearch 1.0.0 with 3 nodes in 3 regions in AWS for about a year now, with about ~5K requests per minute on average.
Last night, although there was no traffic spike, the elasticsearch log started to fill up with the following exception:

org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000)

Shortly after that the CPU of the 3 machines went up to 100% and they became inaccessible.

We restarted them and all is well for now, but we are trying to understand what happened and would greatly appreciate any insight we can get from the members of this forum.
Just to re-iterate, there was no traffic spike.
(We are also waiting to see if there was some network error with amazon, but that wouldn't fully explain the CPU spike).


Do you know if you got this exception as part of a search or indexing operation (the full stack trace could help figure it out)?

It was a searching operation.

How much data and indices + shards in your cluster.

Crash was almost 100% caused because of the AWS downtime:
Thanks for the help anyway :smile: