Random node disconnects - Java.io.IOException: Connection timed out

EDIT 3:
Solution TLDR :smile: :
I found a way to finally solve this.
This blog post describes it very well.
My linux kernel has a bug with the network scatter/gather functionality. Turning this off solves this issue completely!


EDIT 2:
First try - this was not the Problem!!
the OS was closing connections that were still needed.
sysctl -w
net.ipv4.tcp_keepalive_time=600
net.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=3

found at

We just recently started a new cluster on new hardware (bare metal and virtual)
We have experience in running a big cluster (1,5TB) for about 1+ years

The new cluster has huge connection issues.
It disconnects random nodes every 0- 4 hours
The cluster then gets yellow before turning green again within a few minutes.
The node is reconnected to the cluster within one minute or less.

Some information about the cluster:
4 virutal servers:
8 CPU
32GB RAM

2 bare metal:
20 CPU
64GB RAM

3 master nodes, 3 search nodes, 6 data nodes
Every server has one data + either search or master

130GB of data, growing rapidly by about 25GB per day
400 indices, 5 shards 1 replica -> 4000 shards
400.000.000 docs

our old cluster was way bigger and had much smaller hardware and worked fine

Please ask if you need any more information.

This is the first exception when setting transport logging to TRACE

close connection exception caught on transport layer [[id: 0x6ce6ab98, /10.134.6.44:47889 => /10.134.6.38:9301]], disconnecting from relevant node
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

This one also occurs:
[master] stopping fault detection against master [[prod00elastic10-master][8PSr9xVZQS2ysVplwMjBLw][prod00elastic10][inet[/10.134.6.44:9302]]{data=false, master=true}], reason [master failure, do not exists on master, act as master failure]

1 Like

To possibly help out other people having the same problem, I will document my progress here.

I am currently following two things:

as well as

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811