Noticing that nodes in my cluster will sporadically lose connection to the cluster briefly before re-establishing connection. This is causing a lot of wasted overhead on shard allocation. Searching Google, this appears to have been an issue with earlier versions of Elasticsearch particularly with mismatched versions of Java. I've verified all ES and JVM versions on the nodes in my cluster are the same, but still seeing the error occur. Anyone else solved this?
All nodes are configured with the following for their zen discovery options:
discovery.zen.ping.unicast.hosts: [ "IP ADDRESES OF NODES ABOVE" ]
discovery.zen.ping.unicast.hosts.resolve_timeout: 5s
discovery.zen.ping_timeout: 10s #default is 3s
discovery.zen.join_timeout: 60s #default is 20x the ping timeout
discovery.zen.master_election.ignore_non_master_pings: true
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_interval: 3s #default is 1s
discovery.zen.fd.ping_timeout: 45s #default is 30s
discovery.zen.fd.ping_retries: 3 #default is 3x
Error message from the acting master node log looks like:
org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse]
at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:1278) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1250) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-5.1.1.jar:5.1.1]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:1275) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1250) [elasticsearch-5.1.1.jar:5.1.1]
[2016-12-20T03:02:05,406][WARN ][o.e.x.s.t.n.SecurityNetty4Transport] [esearchp02-mn1] exception caught on transport layer [[id: 0x9b24964f, L:/10.20.215.193:54072 - R:10.20.215.222/10.20.215.2
22:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [1913013], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch
.action.support.nodes.TransportNodesAction$AsyncAction$1@44aa70c], error [false]; resetting
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1257) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[transport-netty4-5.1.1.jar:5.1.1]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
Would you be willing to start your nodes with assertions enabled (add -ea to jvm.options or include -ea in ES_JAVA_OPTS)? Right now we are hunting for a needle in a haystack. I'm hoping that we are just blowing an assertion somewhere and we can get a stack trace that points to the issue.
I ran our cluster through the migration checks from 2.x to 5.x though I don't recall having to do anything special besides that. There might be some indices that were closed when I ran that tool, so maybe they're still in 2.x format?
Naive question, but what's the impact of adding those options to the ES_JAVA_OPTS and to confirm that would need to be done on all nodes within the cluster?
In general, it's okay to have older indices in the cluster, they will have just been upgraded the first time you started the cluster with them. We think this issue might be related to indices stats from older indices, that why I am asking if you have such indices or not. If you do not, we can stop thinking that might be the issue. If you do, it's another data point that it is the issue.
As far as enabling assertions, right now you have some stats calls that are blowing up. We don't know why, we think the serialization is borked somewhere. We have some assertions in the code that might help uncover where. They don't run by default, only when we run tests. If you enable them, and the issue is one of the things that we have assertions for, rather than blowing up like you're blowing up right now, you'll blow up differently, but with a stack trace that tells us where the borked serialization is coming from. It will be massively helpful, and won't impact your cluster any differently than how you're being affected right now.
Ok, sounds reasonable to me. I'll put the options on the nodes, restart them, and report back with what happens. Any quick way to test an index to see if it's a 2.x version vs a 5.x version?
We are still working on this, we think the 2.x vs. 5.x thing might be a red herring. What we are looking for at this point is a blow up with a stack trace like this:
org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:218) ~[elasticsearch-6.0.0-alpha1-SNAPSHOT.jar:6.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.index.shard.DocsStats.writeTo(DocsStats.java:69) ~[elasticsearch-6.0.0-alpha1-SNAPSHOT.jar:6.0.0-alpha1-SNAPSHOT]
We think that we found it, but if you can confirm that you see a stack trace like that with assertions enabled that would be helpful.
That could very well be it. I put the -es in the java_opts and restarted the nodes in the cluster. This particular dn node exited with the last message being this:
[2016-12-21T15:07:18,453][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [esearchp03-dn1] fatal error in thread [elasticsearch[esearchp03-dn1][management][T#1]], exiting
java.lang.AssertionError: null
at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:218) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.index.shard.DocsStats.writeTo(DocsStats.java:77) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.admin.indices.stats.CommonStats.writeTo(CommonStats.java:278) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.admin.indices.stats.ShardStats.writeTo(ShardStats.java:103) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalStreamable(StreamOutput.java:697) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse.writeTo(TransportBroadcastByNodeAction.java:593) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport.buildMessage(TcpTransport.java:1093) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1037) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:67) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:61) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.DelegatingTransportChannel.sendResponse(DelegatingTransportChannel.java:58) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.RequestHandlerRegistry$TransportChannelWrapper.sendResponse(RequestHandlerRegistry.java:111) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:426) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:400) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$1(SecurityServerTransportInterceptor.java:208) ~[?:?]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:255) ~[?:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1385) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
[2016-12-21T15:28:08,244][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [esearchp01-dn1] fatal error in thread [elasticsearch[esearchp01-dn1][management][T#1]], exiting
java.lang.AssertionError: null
at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:218) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.index.shard.DocsStats.writeTo(DocsStats.java:77) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.admin.indices.stats.CommonStats.writeTo(CommonStats.java:278) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.admin.indices.stats.ShardStats.writeTo(ShardStats.java:103) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalStreamable(StreamOutput.java:697) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse.writeTo(TransportBroadcastByNodeAction.java:593) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport.buildMessage(TcpTransport.java:1093) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1037) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:67) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:61) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.DelegatingTransportChannel.sendResponse(DelegatingTransportChannel.java:58) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.RequestHandlerRegistry$TransportChannelWrapper.sendResponse(RequestHandlerRegistry.java:111) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:426) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:400) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$1(SecurityServerTransportInterceptor.java:208) ~[?:?]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:255) ~[?:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1385) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
@ben-10 Thank you so much for the report and especially enabling assertions to confirm that we found the issue. That is incredibly helpful to know that we really got it. We will get a patch release out soon with the fix, but I do not think it will be until after the holidays. There is no workaround for now other than not invoking the stats requests that can trigger this (it's a race condition, that's why it happens rarely). I realize that this might not be tenable as you might need those stats as part of your monitoring, but that's the best that we can do right now.
You're welcome, glad I actually had something to contribute this time around to help. I think that the doc stat queries are all coming from x-pack.monitoring in the 5.x Kibana. If I'm following, ironically checking my cluster and indices causes the instability in the cluster
Again, thanks for the help and looking forward to the patch after everyone enjoys the holidays!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.