Random node disconnects - Java.io.IOException: Connection timed out

lwintergerst · October 28, 2015, 1:24pm

EDIT 3:
Solution TLDR :
I found a way to finally solve this.
This blog post describes it very well.
My linux kernel has a bug with the network scatter/gather functionality. Turning this off solves this issue completely!

EDIT 2:
First try - this was not the Problem!!
the OS was closing connections that were still needed.
sysctl -w
net.ipv4.tcp_keepalive_time=600
net.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=3

found at

We just recently started a new cluster on new hardware (bare metal and virtual)
We have experience in running a big cluster (1,5TB) for about 1+ years

The new cluster has huge connection issues.
It disconnects random nodes every 0- 4 hours
The cluster then gets yellow before turning green again within a few minutes.
The node is reconnected to the cluster within one minute or less.

Some information about the cluster:
4 virutal servers:
8 CPU
32GB RAM

2 bare metal:
20 CPU
64GB RAM

3 master nodes, 3 search nodes, 6 data nodes
Every server has one data + either search or master

130GB of data, growing rapidly by about 25GB per day
400 indices, 5 shards 1 replica -> 4000 shards
400.000.000 docs

our old cluster was way bigger and had much smaller hardware and worked fine

Please ask if you need any more information.

This is the first exception when setting transport logging to TRACE

close connection exception caught on transport layer [[id: 0x6ce6ab98, /10.134.6.44:47889 => /10.134.6.38:9301]], disconnecting from relevant node
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

This one also occurs:
[master] stopping fault detection against master [[prod00elastic10-master][8PSr9xVZQS2ysVplwMjBLw][prod00elastic10][inet[/10.134.6.44:9302]]{data=false, master=true}], reason [master failure, do not exists on master, act as master failure]

lwintergerst · October 28, 2015, 2:38pm

To possibly help out other people having the same problem, I will document my progress here.

I am currently following two things:

as well as

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811

Topic		Replies	Views
Data Nodes disconnected randomly Elasticsearch	3	234	March 9, 2023
Java application disconnects from Elasticsearch cluster Elasticsearch	5	841	April 2, 2018
ES 1.4.2 random node disconnect Elasticsearch	4	417	July 6, 2017
Random exceptions on transport layer and subsequent node disconnections Elasticsearch	9	4050	January 20, 2017
Elasticsearch nodes continually disconneting/reconnecting. Resulting in very high number of unassigned shards Elasticsearch	18	2778	September 3, 2020

Random node disconnects - Java.io.IOException: Connection timed out

Related topics