Hi, everyone.
After a recent upgrade to Elastic Stack 5 my team has been encountering this strange issue with random node disconnections after a java.io.EOFException on the master node.
Our cluster consists of three nodes, each one resides on a VMware VM with 16 vCPUs and 32GBs of RAM.
Each Elasticsearch instance has a heap of 16GBs. We're currently averaging at 10k docs indexed per second with rare peaks up to 15-20k dps.
There are 114 indices, 792 shards, 2 321 630 201 docs (excluding replicas) and 2.37TB of data (including replicas) in our cluster.
Each node is running ES 5.1.1 (the disconnections started after upgrading from 2.4 to 5.0.2). The JVM version is the same on all nodes: (Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 1.8.0_112-b15)).
So here is the problem we've been struggling with: sometimes one of the three nodes loses connection to the master node and leaves the cluster for a couple of seconds, rejoining it afterwards without any issues.
Here are excerpts from the ES log (I've replaced the IP addresses with node names) on the master node and on the node that left the cluster:
http://pastebin.com/LMU3pjz5 (from the master node)
http://pastebin.com/Qb0ehrGi (from the node that left the cluster)
As you can see it all starts with the master node getting an EOF while reading a response from the other node:
org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse]
at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:1278) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1250) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-5.1.1.jar:5.1.1]
...
Caused by: java.io.EOFException: tried to read: 91755306 bytes but only 114054 remaining
at org.elasticsearch.transport.netty4.ByteBufStreamInput.ensureCanReadBytes(ByteBufStreamInput.java:75) ~[?:?]
at org.elasticsearch.common.io.stream.FilterStreamInput.ensureCanReadBytes(FilterStreamInput.java:80) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.io.stream.StreamInput.readArraySize(StreamInput.java:892) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.io.stream.StreamInput.readString(StreamInput.java:334) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.index.Index.<init>(Index.java:64) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.index.shard.ShardId.readFrom(ShardId.java:101) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.index.shard.ShardId.readShardId(ShardId.java:95) ~[elasticsearch-5.1.1.jar:5.1.1]
...
Then the connection is closed and the node leaves the cluster:
[2016-12-12T09:26:50,081][WARN ][o.e.t.n.Netty4Transport ] [elastic-fk-node01] exception caught on transport layer [[id: 0xcbdaf621, L:/elastic-fk-node01:35678 - R:elastic-fk-node02/elastic-fk-node02:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [79888769], handler
...
[2016-12-12T09:26:50,087][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elastic-fk-node01] failed to execute on node [_EO_pBi6S4iyWSyPgo9LbA]
org.elasticsearch.transport.NodeDisconnectedException: [elastic-fk-node02][elastic-fk-node02:9300][cluster:monitor/nodes/stats[n]] disconnected
I've noticed that after we had disabled cluster monitoring (via Zabbix + elasticbeat) the disconnections became less frequent, but this cluster did not have any issues before the 5.0.2 upgrade with the same monitoring system in place. Right now the only thing that queries the stats APIs is Cerebro (https://github.com/lmenezes/cerebro) v0.4.1.
So the question is: is this 100% a networking issue (we've been unable to find any so far, but our network engineers are still investigating) or am I missing something? I'd appreciate any tips on where to look next because all I've been able to find on this were issues with different ES or Java versions which is not the case here.