Remote Transport Exception in 5.1.1 w/JDK 1.8.0_111

ben-10 · December 20, 2016, 6:53pm

Noticing that nodes in my cluster will sporadically lose connection to the cluster briefly before re-establishing connection. This is causing a lot of wasted overhead on shard allocation. Searching Google, this appears to have been an issue with earlier versions of Elasticsearch particularly with mismatched versions of Java. I've verified all ES and JVM versions on the nodes in my cluster are the same, but still seeing the error occur. Anyone else solved this?

esearchp04-cn1:~$ curl localhost:9200/_cat/nodes?h=n,v,j
esearchp02-dn1 5.1.1 1.8.0_111
esearchp04-cn1 5.1.1 1.8.0_111
esearchp04-cn2 5.1.1 1.8.0_111
esearchp03-dn1 5.1.1 1.8.0_111
esearchp02-mn1 5.1.1 1.8.0_111
esearchp03-mn1 5.1.1 1.8.0_111
esearchp01-dn1 5.1.1 1.8.0_111
esearchp01-mn1 5.1.1 1.8.0_111

All nodes are configured with the following for their zen discovery options:

discovery.zen.ping.unicast.hosts: [ "IP ADDRESES OF NODES ABOVE" ]
discovery.zen.ping.unicast.hosts.resolve_timeout: 5s
discovery.zen.ping_timeout: 10s #default is 3s
discovery.zen.join_timeout: 60s #default is 20x the ping timeout
discovery.zen.master_election.ignore_non_master_pings: true
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_interval: 3s #default is 1s
discovery.zen.fd.ping_timeout: 45s #default is 30s
discovery.zen.fd.ping_retries: 3 #default is 3x

Error message from the acting master node log looks like:

org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]
        at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:1278) [elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1250) [elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-5.1.1.jar:5.1.1]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:1275) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1250) [elasticsearch-5.1.1.jar:5.1.1]

jpountz · December 20, 2016, 7:05pm

There should be another "Caused by" below the message you shared that should help us know more about what happened.

ben-10 · December 20, 2016, 7:15pm

[2016-12-20T03:02:05,406][WARN ][o.e.x.s.t.n.SecurityNetty4Transport] [esearchp02-mn1] exception caught on transport layer [[id: 0x9b24964f, L:/10.20.215.193:54072 - R:10.20.215.222/10.20.215.2
22:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [1913013], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch
.action.support.nodes.TransportNodesAction$AsyncAction$1@44aa70c], error [false]; resetting
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1257) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[transport-netty4-5.1.1.jar:5.1.1]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]

jasontedor · December 20, 2016, 8:22pm

I've seen this several times. Here's another report, and I opened https://github.com/elastic/elasticsearch/issues/22285.

s1monw · December 20, 2016, 8:58pm

can you tell what call causes this? Also does it happen all the time?

ben-10 · December 21, 2016, 3:21pm

It happens pretty frequently, at least a couple times a day. Too large to copy paste, please see: http://pastebin.com/iiJSDjuM

jasontedor · December 21, 2016, 5:20pm

Would you be willing to start your nodes with assertions enabled (add -ea to jvm.options or include -ea in ES_JAVA_OPTS)? Right now we are hunting for a needle in a haystack. I'm hoping that we are just blowing an assertion somewhere and we can get a stack trace that points to the issue.

jasontedor · December 21, 2016, 5:26pm

Also, do you have older indices in your cluster, from version 2.x?

ben-10 · December 21, 2016, 5:32pm

I ran our cluster through the migration checks from 2.x to 5.x though I don't recall having to do anything special besides that. There might be some indices that were closed when I ran that tool, so maybe they're still in 2.x format?

Naive question, but what's the impact of adding those options to the ES_JAVA_OPTS and to confirm that would need to be done on all nodes within the cluster?

jasontedor · December 21, 2016, 5:48pm

In general, it's okay to have older indices in the cluster, they will have just been upgraded the first time you started the cluster with them. We think this issue might be related to indices stats from older indices, that why I am asking if you have such indices or not. If you do not, we can stop thinking that might be the issue. If you do, it's another data point that it is the issue.

As far as enabling assertions, right now you have some stats calls that are blowing up. We don't know why, we think the serialization is borked somewhere. We have some assertions in the code that might help uncover where. They don't run by default, only when we run tests. If you enable them, and the issue is one of the things that we have assertions for, rather than blowing up like you're blowing up right now, you'll blow up differently, but with a stack trace that tells us where the borked serialization is coming from. It will be massively helpful, and won't impact your cluster any differently than how you're being affected right now.

ben-10 · December 21, 2016, 6:57pm

Ok, sounds reasonable to me. I'll put the options on the nodes, restart them, and report back with what happens. Any quick way to test an index to see if it's a 2.x version vs a 5.x version?

jasontedor · December 21, 2016, 7:32pm

We are still working on this, we think the 2.x vs. 5.x thing might be a red herring. What we are looking for at this point is a blow up with a stack trace like this:

org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:218) ~[elasticsearch-6.0.0-alpha1-SNAPSHOT.jar:6.0.0-alpha1-SNAPSHOT]
    at org.elasticsearch.index.shard.DocsStats.writeTo(DocsStats.java:69) ~[elasticsearch-6.0.0-alpha1-SNAPSHOT.jar:6.0.0-alpha1-SNAPSHOT]

We think that we found it, but if you can confirm that you see a stack trace like that with assertions enabled that would be helpful.

ben-10 · December 21, 2016, 8:14pm

That could very well be it. I put the -es in the java_opts and restarted the nodes in the cluster. This particular dn node exited with the last message being this:

[2016-12-21T15:07:18,453][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [esearchp03-dn1] fatal error in thread [elasticsearch[esearchp03-dn1][management][T#1]], exiting
java.lang.AssertionError: null
        at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:218) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.index.shard.DocsStats.writeTo(DocsStats.java:77) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.admin.indices.stats.CommonStats.writeTo(CommonStats.java:278) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.admin.indices.stats.ShardStats.writeTo(ShardStats.java:103) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalStreamable(StreamOutput.java:697) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse.writeTo(TransportBroadcastByNodeAction.java:593) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.buildMessage(TcpTransport.java:1093) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1037) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:67) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:61) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.DelegatingTransportChannel.sendResponse(DelegatingTransportChannel.java:58) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.RequestHandlerRegistry$TransportChannelWrapper.sendResponse(RequestHandlerRegistry.java:111) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:426) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:400) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$1(SecurityServerTransportInterceptor.java:208) ~[?:?]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:255) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1385) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.1.1.jar:5.1.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_111]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_111]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

ben-10 · December 21, 2016, 8:31pm

And from another node that failed in the cluster:

[2016-12-21T15:28:08,244][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [esearchp01-dn1] fatal error in thread [elasticsearch[esearchp01-dn1][management][T#1]], exiting
java.lang.AssertionError: null
        at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:218) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.index.shard.DocsStats.writeTo(DocsStats.java:77) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.admin.indices.stats.CommonStats.writeTo(CommonStats.java:278) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.admin.indices.stats.ShardStats.writeTo(ShardStats.java:103) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalStreamable(StreamOutput.java:697) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse.writeTo(TransportBroadcastByNodeAction.java:593) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.buildMessage(TcpTransport.java:1093) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1037) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:67) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:61) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.DelegatingTransportChannel.sendResponse(DelegatingTransportChannel.java:58) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.RequestHandlerRegistry$TransportChannelWrapper.sendResponse(RequestHandlerRegistry.java:111) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:426) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:400) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$1(SecurityServerTransportInterceptor.java:208) ~[?:?]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:255) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1385) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.1.1.jar:5.1.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_111]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_111]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

nik9000 · December 21, 2016, 8:35pm

Perfect. We're talking about half of the fix (don't corrupt the stream on negative numbers) here:

And the other half of the fix we're working on (don't get negative numbers in the first place).

jasontedor · December 21, 2016, 8:36pm

Thank you. It's the same issue, we've found it. You can disable assertions. We will get a fix out soon.

ben-10 · December 21, 2016, 8:36pm

Thanks!

jasontedor · December 21, 2016, 9:59pm

Here's the other half of the fix:

jasontedor · December 21, 2016, 10:19pm

@ben-10 Thank you so much for the report and especially enabling assertions to confirm that we found the issue. That is incredibly helpful to know that we really got it. We will get a patch release out soon with the fix, but I do not think it will be until after the holidays. There is no workaround for now other than not invoking the stats requests that can trigger this (it's a race condition, that's why it happens rarely). I realize that this might not be tenable as you might need those stats as part of your monitoring, but that's the best that we can do right now.

ben-10 · December 21, 2016, 10:37pm

You're welcome, glad I actually had something to contribute this time around to help. I think that the doc stat queries are all coming from x-pack.monitoring in the 5.x Kibana. If I'm following, ironically checking my cluster and indices causes the instability in the cluster

Again, thanks for the help and looking forward to the patch after everyone enjoys the holidays!

Topic		Replies	Views
Loss of Connection between Nodes Elasticsearch	5	745	July 6, 2017
500 error from elasticsearch Elasticsearch	3	1512	July 6, 2017
Node stats - Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.node.stats.NodeStats] Elasticsearch	7	2233	July 6, 2017
NodeStats serialization error Elasticsearch	8	1087	July 6, 2017
Serialization issues on 0.90.3 Elasticsearch	9	506	July 6, 2017

Remote Transport Exception in 5.1.1 w/JDK 1.8.0_111

Related topics