TCP backlog overload on 9200

vizl · October 24, 2016, 1:14pm

Hello I expect stange issues with my ElasticSearch cluster with 3 nodes.

I have 24G indicies, and ES_HEAP_SIZE=32G

So when resident memory (RSS) of Java process of ElasticSearch increased to 32G - my TCP backlog on socket 9200 overloaded and I get "tcp connection timeout" to this socket.

In my logs I does not see any warnings, only

[2016-10-24 15:07:23,348][INFO ][monitor.jvm ] [Node1] [gc][young][45420][19353] duration [839ms], collections [1]/[1.1s], total [839ms]/[1.7h], memory [9.3gb]->[8.2gb]/[31.8gb], all_pools {[young] [1.1gb]->[1.9mb]/[1.1gb]}{[survivor] [126.8mb]->[127.3mb]/[149.7mb]}{[old] [8gb]->[8.1gb]/[30.5gb]}

So, when I restarted node on cluster - this problem is gone till RSS doesn't rise to ES_HEAP_SIZE.

elasticsearch.yml

node.master: true
node.data: true
network.host:
threadpool.search.size: 5000
index.store.throttle.type: none
discovery.zen.minimum_master_nodes: 2

I tried tuning these parameters
threadpool.search.size: 5000 ( was 1000)
index.store.throttle.type: none (was merge)

but this doesn't help.

jprante · October 25, 2016, 12:45pm

How did you find this? What does the error message look like? What OS is this?[quote="vizl, post:1, topic:63749"]
threadpool.search.size: 5000
[/quote]

Do not set this to 5000, way too high. You are not tuning, you make your system unusable.

vizl · October 25, 2016, 1:12pm

Hello, thanks for answer.
My OS is Debian GNU/Linux 8 with 3.16 kernel.

From server side looks like:

I run linux command 'ss' which show me Receive queue of tcp stack.
ss -nlt | grep 9200
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 51 50 ::ffff:0.0.0.0:9200 :::*

So receive queue is overloaded and linux can't receive new connection, From linux man pages:

" This value determines the number of fully acknowledged (SYN -> SYN/ACK -> ACK) connections that are waiting to be accept()ed by the process. When requests are being processed quickly, this value should be 0."

And I guess elastic does not have time to accept connection and has some internal queue.

Also at this time in server only INFO messeges

[2016-10-24 15:07:23,348][INFO ][monitor.jvm ] [Node1] [gc][young][45420][19353] duration [839ms], collections [1]/[1.1s], total [839ms]/[1.7h], memory [9.3gb]->[8.2gb]/[31.8gb], all_pools {[young] [1.1gb]->[1.9mb]/[1.1gb]}{[survivor] [126.8mb]->[127.3mb]/[149.7mb]}{[old] [8gb]->[8.1gb]/[30.5gb]}

From client:

Failed to connect to x.x.x.x port 9200: Connection refused

or Connection Timeout

vizl · October 25, 2016, 1:25pm

UPD.
Sometimes from client size I see error

"Uncaught exception:"Elastica\Exception\ResponseException" message:"[reduce] ""

jprante · October 25, 2016, 1:47pm

Your diagnosis is not correct. If ss or netstat report a positive count on recv-q it means the operating system has problems and you must solve it on operating system layer.

If Elastisearch can not process incoming requests fast enough, you would see exceptions and error messages in the Elasticsearch log.

You also receive connection refused. This is a strong hint that something with your network setup is not correct because the port is not open. I suggest to check firewall/network filter settings.

This is an unrelated issue.

vizl · October 25, 2016, 1:55pm

Thanks but I don't think this this is operational system layer. This queue may be overhload in the any application (apache, Nginx) when application can't return syscall accept to the client.

Sometimes I see such errors in elastic logs

[Node1] [12388984] Failed to execute fetch phase
RemoteTransportException[[Node10][88.208.16.167:9300][indices:data/read/search[phase/fetch/id]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@5fac576d on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@7a50e2d1[Running, pool size = 37, active threads = 37, queued tasks = 1000, completed tasks = 24709461]]];

warkolm · October 26, 2016, 9:27am

That means ES is overloaded, you need to increase the resources it has available.

vizl · October 26, 2016, 2:15pm

But I have enough system resources , at the same time: 30G Free Memory, 70% CPU idle , 1-2% disk (SSD) utilization.
So I gues elastic doesn't optimal use system recources and has some internal limits.
I asked about this hoping to get some tunning recomendations from Elastic
community.

vizl · October 31, 2016, 8:13pm

So as I understant it's internal elastic limitations and no sense to use Elasticsearch on powerfull node with more than 8 CPU and more than 32G memory. Elastic "from box" can't optimal use system recources and refused connection to it while there is no any high load to server.

Topic		Replies	Views
Problems with tcp connections Elasticsearch	11	1478	July 6, 2017
Elasticsearch socket block OR timeout Elasticsearch	5	798	November 6, 2019
Our Elastic search server can not serve many connections for a long time Elasticsearch	1	693	July 6, 2017
Best practice for thread pool queue size Elasticsearch	3	2052	July 6, 2017
Gc overhead reduces ElasticSearch Performance Elasticsearch	14	13459	September 22, 2018

TCP backlog overload on 9200

Related topics