Remote Transport Exception in 5.1.1 w/JDK 1.8.0_111

Noticing that nodes in my cluster will sporadically lose connection to the cluster briefly before re-establishing connection. This is causing a lot of wasted overhead on shard allocation. Searching Google, this appears to have been an issue with earlier versions of Elasticsearch particularly with mismatched versions of Java. I've verified all ES and JVM versions on the nodes in my cluster are the same, but still seeing the error occur. Anyone else solved this?

esearchp04-cn1:~$ curl localhost:9200/_cat/nodes?h=n,v,j
esearchp02-dn1 5.1.1 1.8.0_111
esearchp04-cn1 5.1.1 1.8.0_111
esearchp04-cn2 5.1.1 1.8.0_111
esearchp03-dn1 5.1.1 1.8.0_111
esearchp02-mn1 5.1.1 1.8.0_111
esearchp03-mn1 5.1.1 1.8.0_111
esearchp01-dn1 5.1.1 1.8.0_111
esearchp01-mn1 5.1.1 1.8.0_111

All nodes are configured with the following for their zen discovery options:

discovery.zen.ping.unicast.hosts: [ "IP ADDRESES OF NODES ABOVE" ]
discovery.zen.ping.unicast.hosts.resolve_timeout: 5s
discovery.zen.ping_timeout: 10s #default is 3s
discovery.zen.join_timeout: 60s #default is 20x the ping timeout
discovery.zen.master_election.ignore_non_master_pings: true
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_interval: 3s #default is 1s
discovery.zen.fd.ping_timeout: 45s #default is 30s
discovery.zen.fd.ping_retries: 3 #default is 3x

Error message from the acting master node log looks like:

org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]
        at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:1278) [elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1250) [elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-5.1.1.jar:5.1.1]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:1275) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1250) [elasticsearch-5.1.1.jar:5.1.1]

There should be another "Caused by" below the message you shared that should help us know more about what happened.

[2016-12-20T03:02:05,406][WARN ][o.e.x.s.t.n.SecurityNetty4Transport] [esearchp02-mn1] exception caught on transport layer [[id: 0x9b24964f, L:/10.20.215.193:54072 - R:10.20.215.222/10.20.215.2
22:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [1913013], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch
.action.support.nodes.TransportNodesAction$AsyncAction$1@44aa70c], error [false]; resetting
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1257) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[transport-netty4-5.1.1.jar:5.1.1]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]

I've seen this several times. Here's another report, and I opened https://github.com/elastic/elasticsearch/issues/22285.

can you tell what call causes this? Also does it happen all the time?

It happens pretty frequently, at least a couple times a day. Too large to copy paste, please see: http://pastebin.com/iiJSDjuM

Would you be willing to start your nodes with assertions enabled (add -ea to jvm.options or include -ea in ES_JAVA_OPTS)? Right now we are hunting for a needle in a haystack. I'm hoping that we are just blowing an assertion somewhere and we can get a stack trace that points to the issue.

Also, do you have older indices in your cluster, from version 2.x?

I ran our cluster through the migration checks from 2.x to 5.x though I don't recall having to do anything special besides that. There might be some indices that were closed when I ran that tool, so maybe they're still in 2.x format?

Naive question, but what's the impact of adding those options to the ES_JAVA_OPTS and to confirm that would need to be done on all nodes within the cluster?

In general, it's okay to have older indices in the cluster, they will have just been upgraded the first time you started the cluster with them. We think this issue might be related to indices stats from older indices, that why I am asking if you have such indices or not. If you do not, we can stop thinking that might be the issue. If you do, it's another data point that it is the issue.

As far as enabling assertions, right now you have some stats calls that are blowing up. We don't know why, we think the serialization is borked somewhere. We have some assertions in the code that might help uncover where. They don't run by default, only when we run tests. If you enable them, and the issue is one of the things that we have assertions for, rather than blowing up like you're blowing up right now, you'll blow up differently, but with a stack trace that tells us where the borked serialization is coming from. It will be massively helpful, and won't impact your cluster any differently than how you're being affected right now.

Ok, sounds reasonable to me. I'll put the options on the nodes, restart them, and report back with what happens. Any quick way to test an index to see if it's a 2.x version vs a 5.x version?

We are still working on this, we think the 2.x vs. 5.x thing might be a red herring. What we are looking for at this point is a blow up with a stack trace like this:

org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:218) ~[elasticsearch-6.0.0-alpha1-SNAPSHOT.jar:6.0.0-alpha1-SNAPSHOT]
    at org.elasticsearch.index.shard.DocsStats.writeTo(DocsStats.java:69) ~[elasticsearch-6.0.0-alpha1-SNAPSHOT.jar:6.0.0-alpha1-SNAPSHOT]

We think that we found it, but if you can confirm that you see a stack trace like that with assertions enabled that would be helpful.

That could very well be it. I put the -es in the java_opts and restarted the nodes in the cluster. This particular dn node exited with the last message being this:

[2016-12-21T15:07:18,453][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [esearchp03-dn1] fatal error in thread [elasticsearch[esearchp03-dn1][management][T#1]], exiting
java.lang.AssertionError: null
        at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:218) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.index.shard.DocsStats.writeTo(DocsStats.java:77) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.admin.indices.stats.CommonStats.writeTo(CommonStats.java:278) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.admin.indices.stats.ShardStats.writeTo(ShardStats.java:103) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalStreamable(StreamOutput.java:697) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse.writeTo(TransportBroadcastByNodeAction.java:593) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.buildMessage(TcpTransport.java:1093) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1037) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:67) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:61) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.DelegatingTransportChannel.sendResponse(DelegatingTransportChannel.java:58) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.RequestHandlerRegistry$TransportChannelWrapper.sendResponse(RequestHandlerRegistry.java:111) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:426) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:400) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$1(SecurityServerTransportInterceptor.java:208) ~[?:?]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:255) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1385) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.1.1.jar:5.1.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_111]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_111]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

And from another node that failed in the cluster:

[2016-12-21T15:28:08,244][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [esearchp01-dn1] fatal error in thread [elasticsearch[esearchp01-dn1][management][T#1]], exiting
java.lang.AssertionError: null
        at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:218) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.index.shard.DocsStats.writeTo(DocsStats.java:77) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.admin.indices.stats.CommonStats.writeTo(CommonStats.java:278) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.admin.indices.stats.ShardStats.writeTo(ShardStats.java:103) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalStreamable(StreamOutput.java:697) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse.writeTo(TransportBroadcastByNodeAction.java:593) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.buildMessage(TcpTransport.java:1093) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1037) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:67) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:61) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.DelegatingTransportChannel.sendResponse(DelegatingTransportChannel.java:58) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.RequestHandlerRegistry$TransportChannelWrapper.sendResponse(RequestHandlerRegistry.java:111) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:426) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:400) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$1(SecurityServerTransportInterceptor.java:208) ~[?:?]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:255) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1385) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.1.1.jar:5.1.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_111]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_111]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

Perfect. We're talking about half of the fix (don't corrupt the stream on negative numbers) here:

And the other half of the fix we're working on (don't get negative numbers in the first place).

Thank you. It's the same issue, we've found it. You can disable assertions. We will get a fix out soon.

Thanks!

Here's the other half of the fix:

@ben-10 Thank you so much for the report and especially enabling assertions to confirm that we found the issue. That is incredibly helpful to know that we really got it. We will get a patch release out soon with the fix, but I do not think it will be until after the holidays. There is no workaround for now other than not invoking the stats requests that can trigger this (it's a race condition, that's why it happens rarely). I realize that this might not be tenable as you might need those stats as part of your monitoring, but that's the best that we can do right now.

You're welcome, glad I actually had something to contribute this time around to help. I think that the doc stat queries are all coming from x-pack.monitoring in the 5.x Kibana. If I'm following, ironically checking my cluster and indices causes the instability in the cluster :confused:

Again, thanks for the help and looking forward to the patch after everyone enjoys the holidays!