Hi, i have an issue with a 5.5.2 ES cluster.
We see a lot of disconnects between our nodes.
Even sending a simple _nodes query causes random disconnections. In this example. Only the node i've asked answered. Rest failed.
We have 10 Data nodes and 3 Masters in a cluster. 40 indexes with 300shards. 2TB size cluster All in the same Google datacenter.
Os is centos7.3.1611 core.
No network congestion or any unusual behavior and yet this a result of a GET _nodes query.
At this point some nodes disconnect and the cluster becomes yellow with unassigned shards.
All servers are 5.5.2 jvm1.8.0_151.
thats an odd one. The exception means, that a serialized data stream could not be read from another node. This should not happen, especially not when you have the same Elasticsearch versions everywhere.
Can you run
GET _cat/master
GET _cat/nodes?v&h=id,name,version,jdk,node.role,master
Hi. Yes, i'm aware this shouldn't happen. But it still does.
You may notice the cluster is bigger than i initially said, we are replacing some of the machines, you can ignore that.
There is some GC on some of the machines, but this is inconsistent with the nodes that actually disconnect.
You can see in the log i provided in my original post.
Bump.. Any idea in which direction to dig into this issue?
Our cluster still has those errors...
[DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [prod-es-dn-7a] failed to execute on node [SG00OvqdRIaUTz_wUlHMFg]
org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.node.info.NodeInfo]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.node.i
nfo.NodeInfo]..
[o.e.t.n.Netty4Transport ] [prod-es-dn-7a] exception caught on transport layer [[id: 0x8b42e333, L:/10.241.0.76:41400 - R:10
.241.0.75/10.241.0.75:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [70056], handler [org.elasticsearch.transport.TransportService$ContextRestor
eResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1@5caf60c0], error [false]; resetting
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.