TCP backlog overload on 9200

Hello I expect stange issues with my ElasticSearch cluster with 3 nodes.

I have 24G indicies, and ES_HEAP_SIZE=32G

So when resident memory (RSS) of Java process of ElasticSearch increased to 32G - my TCP backlog on socket 9200 overloaded and I get "tcp connection timeout" to this socket.

In my logs I does not see any warnings, only

[2016-10-24 15:07:23,348][INFO ][monitor.jvm ] [Node1] [gc][young][45420][19353] duration [839ms], collections [1]/[1.1s], total [839ms]/[1.7h], memory [9.3gb]->[8.2gb]/[31.8gb], all_pools {[young] [1.1gb]->[1.9mb]/[1.1gb]}{[survivor] [126.8mb]->[127.3mb]/[149.7mb]}{[old] [8gb]->[8.1gb]/[30.5gb]}

So, when I restarted node on cluster - this problem is gone till RSS doesn't rise to ES_HEAP_SIZE.


node.master: true true 5000 none
discovery.zen.minimum_master_nodes: 2

I tried tuning these parameters 5000 ( was 1000) none (was merge)

but this doesn't help.

How did you find this? What does the error message look like? What OS is this?[quote="vizl, post:1, topic:63749"] 5000

Do not set this to 5000, way too high. You are not tuning, you make your system unusable.

Hello, thanks for answer.
My OS is Debian GNU/Linux 8 with 3.16 kernel.

From server side looks like:

I run linux command 'ss' which show me Receive queue of tcp stack.
ss -nlt | grep 9200
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 51 50 ::ffff: :::*

So receive queue is overloaded and linux can't receive new connection, From linux man pages:

" This value determines the number of fully acknowledged (SYN -> SYN/ACK -> ACK) connections that are waiting to be accept()ed by the process. When requests are being processed quickly, this value should be 0."

And I guess elastic does not have time to accept connection and has some internal queue.

Also at this time in server only INFO messeges

[2016-10-24 15:07:23,348][INFO ][monitor.jvm ] [Node1] [gc][young][45420][19353] duration [839ms], collections [1]/[1.1s], total [839ms]/[1.7h], memory [9.3gb]->[8.2gb]/[31.8gb], all_pools {[young] [1.1gb]->[1.9mb]/[1.1gb]}{[survivor] [126.8mb]->[127.3mb]/[149.7mb]}{[old] [8gb]->[8.1gb]/[30.5gb]}

From client:

Failed to connect to x.x.x.x port 9200: Connection refused

or Connection Timeout

Sometimes from client size I see error

"Uncaught exception:"Elastica\Exception\ResponseException" message:"[reduce] ""

Your diagnosis is not correct. If ss or netstat report a positive count on recv-q it means the operating system has problems and you must solve it on operating system layer.

If Elastisearch can not process incoming requests fast enough, you would see exceptions and error messages in the Elasticsearch log.

You also receive connection refused. This is a strong hint that something with your network setup is not correct because the port is not open. I suggest to check firewall/network filter settings.

This is an unrelated issue.

Thanks but I don't think this this is operational system layer. This queue may be overhload in the any application (apache, Nginx) when application can't return syscall accept to the client.

Sometimes I see such errors in elastic logs

[Node1] [12388984] Failed to execute fetch phase
RemoteTransportException[[Node10][][indices:data/read/search[phase/fetch/id]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@5fac576d on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@7a50e2d1[Running, pool size = 37, active threads = 37, queued tasks = 1000, completed tasks = 24709461]]];

That means ES is overloaded, you need to increase the resources it has available.

But I have enough system resources , at the same time: 30G Free Memory, 70% CPU idle , 1-2% disk (SSD) utilization.
So I gues elastic doesn't optimal use system recources and has some internal limits.
I asked about this hoping to get some tunning recomendations from Elastic

So as I understant it's internal elastic limitations and no sense to use Elasticsearch on powerfull node with more than 8 CPU and more than 32G memory. Elastic "from box" can't optimal use system recources and refused connection to it while there is no any high load to server.