Random exceptions on transport layer and subsequent node disconnections

Hi, everyone.

After a recent upgrade to Elastic Stack 5 my team has been encountering this strange issue with random node disconnections after a java.io.EOFException on the master node.

Our cluster consists of three nodes, each one resides on a VMware VM with 16 vCPUs and 32GBs of RAM.
Each Elasticsearch instance has a heap of 16GBs. We're currently averaging at 10k docs indexed per second with rare peaks up to 15-20k dps.
There are 114 indices, 792 shards, 2 321 630 201 docs (excluding replicas) and 2.37TB of data (including replicas) in our cluster.

Each node is running ES 5.1.1 (the disconnections started after upgrading from 2.4 to 5.0.2). The JVM version is the same on all nodes: (Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 1.8.0_112-b15)).

So here is the problem we've been struggling with: sometimes one of the three nodes loses connection to the master node and leaves the cluster for a couple of seconds, rejoining it afterwards without any issues.

Here are excerpts from the ES log (I've replaced the IP addresses with node names) on the master node and on the node that left the cluster:

http://pastebin.com/LMU3pjz5 (from the master node)
http://pastebin.com/Qb0ehrGi (from the node that left the cluster)

As you can see it all starts with the master node getting an EOF while reading a response from the other node:

org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse]
        at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:1278) [elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1250) [elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-5.1.1.jar:5.1.1]
...
Caused by: java.io.EOFException: tried to read: 91755306 bytes but only 114054 remaining
        at org.elasticsearch.transport.netty4.ByteBufStreamInput.ensureCanReadBytes(ByteBufStreamInput.java:75) ~[?:?]
        at org.elasticsearch.common.io.stream.FilterStreamInput.ensureCanReadBytes(FilterStreamInput.java:80) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.io.stream.StreamInput.readArraySize(StreamInput.java:892) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.io.stream.StreamInput.readString(StreamInput.java:334) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.index.Index.<init>(Index.java:64) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.index.shard.ShardId.readFrom(ShardId.java:101) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.index.shard.ShardId.readShardId(ShardId.java:95) ~[elasticsearch-5.1.1.jar:5.1.1]
...

Then the connection is closed and the node leaves the cluster:

[2016-12-12T09:26:50,081][WARN ][o.e.t.n.Netty4Transport  ] [elastic-fk-node01] exception caught on transport layer [[id: 0xcbdaf621, L:/elastic-fk-node01:35678 - R:elastic-fk-node02/elastic-fk-node02:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [79888769], handler
...
[2016-12-12T09:26:50,087][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elastic-fk-node01] failed to execute on node [_EO_pBi6S4iyWSyPgo9LbA]
org.elasticsearch.transport.NodeDisconnectedException: [elastic-fk-node02][elastic-fk-node02:9300][cluster:monitor/nodes/stats[n]] disconnected

I've noticed that after we had disabled cluster monitoring (via Zabbix + elasticbeat) the disconnections became less frequent, but this cluster did not have any issues before the 5.0.2 upgrade with the same monitoring system in place. Right now the only thing that queries the stats APIs is Cerebro (https://github.com/lmenezes/cerebro) v0.4.1.

So the question is: is this 100% a networking issue (we've been unable to find any so far, but our network engineers are still investigating) or am I missing something? I'd appreciate any tips on where to look next because all I've been able to find on this were issues with different ES or Java versions which is not the case here.

We've encountered another variation of this issue: this time it starts with a java.lang.IllegalStateException:

[2016-12-14T14:40:04,173][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elastic-fk-node01] failed to execute [indices:monitor/stats] on node [ZwRuJpSMSNCgtMezxDVoPw]
org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse]
	at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:1278) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1250) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-5.1.1.jar:5.1.1]
...
Caused by: java.lang.IllegalStateException: No routing state mapped for [103]
	at org.elasticsearch.cluster.routing.ShardRoutingState.fromValue(ShardRoutingState.java:71) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.cluster.routing.ShardRouting.<init>(ShardRouting.java:254) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.cluster.routing.ShardRouting.<init>(ShardRouting.java:274) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.action.admin.indices.stats.ShardStats.readFrom(ShardStats.java:92) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.action.admin.indices.stats.ShardStats.readShardStats(ShardStats.java:86) ~[elasticsearch-5.1.1.jar:5.1.1]
...

Then the node disconnects:

[2016-12-14T14:40:04,243][INFO ][o.e.c.r.a.AllocationService] [elastic-fk-node01] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{elastic-fk-node03}{ZwRuJpSMSNCgtMezxDVoPw}{4TMlK1W3S2KxWrpSnMENvA}{elastic-fk-node03}{elastic-fk-node03:9300} transport disconnected]).
[2016-12-14T14:40:04,244][INFO ][o.e.c.s.ClusterService   ] [elastic-fk-node01] removed {{elastic-fk-node03}{ZwRuJpSMSNCgtMezxDVoPw}{4TMlK1W3S2KxWrpSnMENvA}{elastic-fk-node03}{elastic-fk-node03:9300},}, reason: zen-disco-node-failed({elastic-fk-node03}{ZwRuJpSMSNCgtMezxDVoPw}{4TMlK1W3S2KxWrpSnMENvA}{elastic-fk-node03}{elastic-fk-node03:9300}), reason(transport disconnected)[{elastic-fk-node03}{ZwRuJpSMSNCgtMezxDVoPw}{4TMlK1W3S2KxWrpSnMENvA}{elastic-fk-node03}{elastic-fk-node03:9300} transport disconnected]

It's not your network, something is broken in the serialization layer. I've seen four independent reports like this, but so far none of them have given enough information to find the error. This is the most thorough report I've seen, it's very helpful. Any additional information you can provide would be most appreciated.

Well, I don't have much right now, but I'm trying different things out.
I've found a host with several scripts using an old version of the Python elasticsearch library, updated it and ruled out the library as a possible culprit: the disconnections still happen.
Actually, sometimes it looks as if Cerebro is what's doing it. It may very well be a coincidence, but I usually have it open somewhere (that's how I typically notice the cluster went yellow). I'll try not accessing it today and see what happens.
I've also enabled DEBUG-level logging for the root logger and will post anything of interest here. Let me know if you need TRACE enabled in some specific component.
Thank you for confirming it's not our network, I'll let our guys know they can stop searching. :slight_smile:

Here is the log from a data node which was recovering a shard when it got disconnected from the master node: http://pastebin.com/iui9jLLJ. The errors are pretty much the same, debug logging level doesn't seem to produce any interesting details.

I've noticed that the disconnections are becoming less frequent every day:

# grep "exception caught on transport layer" /var/log/elasticsearch/elastic* | grep -Po "(?<=\[)2016\-12\-[0-9]+(?=T)" | sort | uniq -c
      7 2016-12-07
      4 2016-12-08
      1 2016-12-10
      3 2016-12-12
      1 2016-12-13
      2 2016-12-14
      1 2016-12-15

2016-12-07 was the first day after we updated to 5.0.2.

I opened https://github.com/elastic/elasticsearch/issues/22285.

We haven't seen any disconnections since the 15th of December so unfortunately I don't have any new information.

Now that I look at the dates I see that it took exactly 7 days for the issue to go away on its own - this could be related to old indices created in version 2.4 being deleted one by one. We have a cron job that deletes everything older than 7 days every night (via elasticsearch-curator). The first day after the upgrade (2016-12-07) was the worst and then the situation slowly went back to normal, perhaps because there were less indices from 2.4 every day. Maybe it had something to do with requesting stats from the old indices.

Yep, just like you said in issue #22285 the old indices were a red herring, left Cerebro open for a little too long and here we go: http://pastebin.com/raw/GgcYpFtp.

We found the issue and a fix is integrated: https://github.com/elastic/elasticsearch/pull/22317

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.