Client nodes stop responding

We are running into an issue where our client nodes will stop responding to
requests that require checking with other nodes. The following is our setup:

1 dedicated master node

2 dedicated data nodes

3 client nodes (master: false; data: false)

Everything is fine and happy for a while, and then after 30-45 minutes, one
of the client nodes will stop sending responses to queries that require
talking with other nodes. We are using the HTTP REST API. When things go
badly, the following will hang:

curl -XGET ‘http://localhost:9200/_search?size=1’

curl -XGET ‘http://localhost:9200/_cat/thread_pool?v’

But the following will succeed (as it can just use metadata on the node

curl -XGET ‘http://localhost:9200/_cluster/health?pretty=1’

The problem node doesn’t seem to have any CPU or IO load. We don’t seem to
be running into heap issues. netstat doesn’t report any connections in
TIME_WAIT on any of the nodes. If we run queries from the problem client
node at the command prompt directly at the data node, everything works. So,
if we instead run:

curl -XGET ‘http://data.node.ip:9200/_search?size=1

It works as expected. This tells me there isn’t a socket exhaustion issue
since we can make new connections from the problem node to other nodes.

We turned logged all the way up (“ALL”) on one of the client nodes until it
started failing, but there was nothing in there of interest. The last few
minutes just had messages about the idle connection reaper running every

We tried increasing the various connections_per_node values to:

transport.connections_per_node.bulk => 6

transport.connections_per_node.reg => 12

transport.connections_per_node.state => 2 => 2

This made no noticeable difference.

When one of the client nodes has started having problems, the cluster still
sees the node as part of the cluster. When we kill the ES process on that
node, all the other nodes then notice it went away as expected. When we
restart ES on the problem node, it comes back up and everything works great
for another 30-45 minutes.

