Thread pool and channels exploding


I currently have a cluster with ten nodes that are also web fronts. There is one shard and all nodes are replicated using "auto_expand_replicas" to ''0-all". All searches are made locally from the front to the node which is on the same server, with "_only_local" parameter.

When used in production, everything is fine for the first 30-60 minutes. Up to 50 searches are made by second on one node. Then for an unknown reason, thread pool and channels are exploding at the same time and querys become very long (many seconds) because they're staying in queue before being executed. Here are bigdesk's graphs showing that :

I can't understand what's happening at this particular time that makes queries wait in queue. Any idea about it ?
I'm using version 1.5.0 of elasticsearch. Configuration file /etc/default/elasticsearch :

Thanks for your help :blush:

Unsure what's happening with your's likely that some percentage of your queries are "slow" ones that causes search results to start queuing faster than before. I'd enable the slowlog and look to see what it finds.

But it does look like your client is misbehaving. The Search Count metric in BigDesk is the cumulative number completed search threads (iirc). You can see that it is fairly steady until 10:30, then skyrockets. That means your clients are suddenly sending many more requests. At the same time, the number of open HTTP sockets drastically increases to 200.

To me, this looks like your system is either spinning up a lot more connections/clients, or it is rapidly trying to re-send queries. And/or not using persistent keep-alive connections.

It's only when the search rate increases drastically do you start to see the queue build. So I'd investigate what is happening with your client, there may be something odd there. Or perhaps your client is receiving Queue Full rejection exceptions and immediately re-trying instead of backing off.

In general, I would not advise a fully-replicated setup. If you fully-replicate and use preference _local, your latency increases linearly with the number of documents (because each request is only serviced by a single machine). It is also easy to make one node "hot" if your requests are not evenly distributed. And it also vastly slows down indexing, since you have to index the same document on all your nodes.

Hi Zachary,

Thanks for your answer.
I'm already logging slow queries ( > 10s ) and no query has been logged.

I don't have the number of requests sent by the client but I have a cache before elasticsearch with a lifetime of 1 hour, and our problem happens less than one hour after in production. What is very odd is that at the beginning the cache is empty so elasticsearch receive lots and lots of requests, and it has no problem. And the problem happens once it has (a priori) much less work. I will add a tracker to measure how many requests the client is really sending.

There is also one thing I forgot to mention, I upgraded to java 8 because I had problem with the GC and non-heap memory. And I don't remember having this problem with java 7. Could java version have an impact on that behaviour ?

In fact i'm using _local preference and full replicated nodes to avoid network latency because all servers aren't in the same datacenter. I understand now why it's not really a good idea.

I'll add some trackers on the client (requests and exceptions at least) and let you know if I find something :wink:

Hm, ok. I'm not sure, let me know if you dig up more details.

Upgrading to Java 8 shouldn't cause a problem. Did you change the default GC though? We recommend CMS, even on Java 8, because G1 is still buggy. Oracle is still hammering segfaults out of it.

One note, make sure your preference is exactly _local if you are using it. If you use a non-special preference, such as local (note the lack of an underscore), ES will hash that value and use it as a routing value. Which will pin all your traffic to one node.

In fact I tried using G1 and CMS, I got the same result with both.
I tried preference "_local" and "_only_local" and got the same result too. When I don't set the preference, everything is fine for longer than with the preference, something about 3 hours. But then it begins to be very slow.

I should now try to stop using full-replication, as you advised. But I have no idea of how many shards and replicas I should have.
I have 6 indices, the biggest one has 1.6M documents (1.64Gi) which is not so big. The other indices are 300k documents max. I currently have 15 nodes in my cluster, I will soon dispatch nodes in 3 clusters of 5 nodes each (because they are in different datacenters).
Could you advise me for the setup of shards and replicas ? Or should I let the default configuration ?

Thanks again for your help