Kibana - Courier Fetch X of Y shards failed

Apologies for the topic - I know this is a common fault I'm just trying to get my head around our own circumstance which I suspect is a misunderstanding someplace.

About three months ago we switched indexing strategy from writing continuously into one index (periodically removing old documents (with Elasticsearch 1.x)) to writing to daily indexes (with Elasticsearch 2.4). Most recently we set a _all template to reduce the number of shared per index to two, from five.

Increasingly, we're seeing increasing rates of Kibana reporting shard failures on a dashboard with eight graphs showing operational breakdowns.

Most of these graphs draw data from the same daily index and I'm querying for less than one day's data, so seeing a message along the lines of "36 of 350 shards failed" puzzles me - surely it should only be querying the shards containing those indexes for today? There should be no more than a handful of these. I might understand it for a query spanning one month, but one day?

Our cluster stats are here: https://gist.github.com/jmkgreen/ef7aca74230434ea53ec2e3db4336b91

I ran a thread pool query (https://gist.github.com/jmkgreen/bf62d25ac1401ea559eaea0f2532a5ce) but looking at past cases of this the answer is not to increase the queue size but to understand and fix the underlying problem.

Can somebody take a quick look and advise on what we may be done wrong?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.