A lot of 'Courier Fetch: 42 of 155 shards failed.' - EsRejectedExecutionException

I have been getting a lot of those warnings lately in Kibana (4.3.1) when I'm just watching a dashboard over a period of 30 days with 13 visualizations on it (no error/warning are produced when searching for the last 7 days though). While digging more, I managed to get a better error in the JSON returned in the queries. Some of them were pretty much containing that exception.

Looking at the logs of ElasticSearch, the exception was a lot more detailed but still wasn't giving me any clue on why this is happening. Here is part of the exception:

[2015-12-29 16:41:09,731][DEBUG][action.search.type       ] [Cypher] [logstash-2015.12.08][0], node[i2hQZPwoT3qpK8Zvupeg7A], [P], v[12], s[STARTED], a[id=vntyCQvDTle247Turf8Olg]: Failed to execute [org.elasticsearch.action.search.SearchRequest@5749b0ae] lastShard [true]
RemoteTransportException[[Cypher][127.0.0.1:9300][indices:data/read/search[phase/query]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@1bf3a021 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@e3fbd3f[Running, pool size = 10, active threads = 10, queued tasks = 1000, completed tasks = 1672838]]];
Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@1bf3a021 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@e3fbd3f[Running, pool size = 10, active threads = 10, queued tasks = 1000, completed tasks = 1672838]]]

Here is the full exception: http://pastie.org/private/crlbpktacu4k2lfwgexvbq (it's too big for the post).

I got some help on IRC. I installed Marvel to monitor ElasticSearch (2.1.1) and looked at it while generating the issues I got. Nothing seems unusual, everything seems pretty normal (enough memory left, low memory usage, low CPU, etc).

One solution was to increase the size of the thread pool but I was told it is a workaround and I will end up hitting that issue at some point again.

Any pointer or tip on figuring out how the origin of the issue?

What does your cluster look like (number of nodes, type of hardware, amount of Java heap, number of indices/shards, amount of data in the cluster etc)?

1 node, It is a VM (There is no other VM on that server) on a Intel(R) Xeon(R) CPU L5520 @ 2.27GHz with 32Gb, 480Gb SSD in RAID 1.
The VM consist of 6 cores of Intel(R) Xeon(R) CPU L5520 @ 2.27GHz, 24Gb of RAM, 400GB hard disk
and 12Gb of heap is given to ElasticSearch.

There is about ~15.4Gb of data in ES, in 40 indices, 356 shards (half of them are unassigned, the other half is 'started').

Anything else that could help?

Any parameter I could tune to help prevent that issue from happening?

Any idea?

Hi @thomas.dotreppe did you find a solution for this? I have the same issue!

Nope and nobody seems to have a clue what's wrong, even with lots of details.