Courier Fetch: X of Y shards failed

Back when I was using elasticsearch 1.5, I experienced the following error in Kibana 4.2:

Courier Fetch: X of Y shards failed

To compensate for this, I inserted the following line into /etc/elasticsearch/elasticsearch.yml

# Allows for unbounded queue for reads
threadpool.search.type: cached

I don't thoroughly understand the setting, but my impression is that it forces the elasticsearch query (from Kibana) to wait for a response, rather than timing out. Performance isn't critical to me, so that's just fine.

I recently upgraded to elasticsearch 2.1. I received the same "Courier Fetch: X of Y shards failed", so I tried to use the same setting. Unfortunately, I got an error this time. Has this setting been removed from elasticsearch ?

-Chad

The error that I receive is:

setting threadpool.search.type to cached is not permitted; must be fixed

I looked in /var/log/elasticsearch/.log

Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@100f9843 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@3a3dd1dd[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 9343]]]
RemoteTransportException[[Robert Bruce Banner][127.0.0.1:9300][indices:data/read/search[phase/query]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@4ce31bbe on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@3a3dd1dd[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 9343]]];

The key part seems to be ?

search, queue capacity = 1000

So, took a guess, and this seems to solve my problem

I added the following to /etc/elasticsearch/elasticsearch.yml:

threadpool.search.queue_size: 2000

I'm not entirely sure what this does.....''

First, this is addressing the symptom, not the underlying problem. It's important to note here that your cluster was under duress and reporting EsRejectedExecutionExceptions; buried in those exception messages were messages that Elasticsearch's search queue was stuffed. What you did is change the thread pool type for the search thread pool from fixed (a fixed number of workers, bounded work queue) to cached (an unbounded number of works, unbounded work queue). The thread pool types are covered in the documentation.

This change is incredibly dangerous. If your node is having trouble keeping up with work, allowing more work in is not the solution. The EsRejectedExecutionException is a backpressure mechanism desperately trying to signal to the clients: stop, I can not keep up! The fervent hope would be the clients would apply some kind of backoff-retry mechanism until the work queue drains. Clients could either exponentially backoff-retry, or introspect Elasticsearch and check the number of tasks in the work queue before sending the rejected request again.

In fact, the cached thread pool type is so dangerous that in Elasticsearch 2.1.0, Elasticsearch stopped allowing thread pools to be set to type cached. That is why you saw this message:

An unbounded thread pool is reserved for extremely special circumstances, namely requests that absolutely must be served immediately lest Elasticsearch blocks.

Well, that's not quite truthful. Elasticsearch as of 2.1.0 actually completely forbids changing thread pool types at all. The reasoning for this is because changing the thread pool type is very risky with extremely little real-world benefit; it's was deemed not worth the cost to users and the complexity to Elasticsearch to enable changing thread pool types.

Again, you have addressed the symptom and not the problem. It is possible that your workload constantly hovers around 2000 search tasks. But maybe something else is going on? Maybe your cluster is undersized? Maybe your nodes are undersized? Maybe something is wrong with your clients and they sending too many requests?

Increasing the queue size without discerning the cause of the stuffed queue, and without considering the effects of increasing the thread pool could just be postponing the day of reckoning. You most definitely do not want to keep increasing the queue size without finding out why your queues are so constantly stuffed, and whether or not your cluster can handle the workload that it is currently under.

I hope that helps. Happy to help more if the need arises. :slight_smile:

1 Like