Nodes keep disconnecting from cluster at random

Hi, i have an issue with a 5.5.2 ES cluster.
We see a lot of disconnects between our nodes.
Even sending a simple _nodes query causes random disconnections. In this example. Only the node i've asked answered. Rest failed.

We have 10 Data nodes and 3 Masters in a cluster. 40 indexes with 300shards. 2TB size cluster All in the same Google datacenter.
Os is centos7.3.1611 core.
No network congestion or any unusual behavior and yet this a result of a GET _nodes query.
At this point some nodes disconnect and the cluster becomes yellow with unassigned shards.
All servers are 5.5.2 jvm1.8.0_151.

id v
4tgD 5.5.2
9UW6 5.5.2
8scZ 5.5.2
rFIc 5.5.2
XhZN 5.5.2
vNyd 5.5.2
_cmX 5.5.2
UI-s 5.5.2
eQxy 5.5.2
hgyP 5.5.2
xz8x 5.5.2
1bZd 5.5.2
cyJg 5.5.2

Didn't let me paste the whole node info.
https://pastebin.com/xLP99vXM

Stack trace of one of the nodes while running this
https://pastebin.com/wXs5SgNp

Hey,

thats an odd one. The exception means, that a serialized data stream could not be read from another node. This should not happen, especially not when you have the same Elasticsearch versions everywhere.

Can you run

GET _cat/master
GET _cat/nodes?v&h=id,name,version,jdk,node.role,master

Hi. Yes, i'm aware this shouldn't happen. But it still does.
You may notice the cluster is bigger than i initially said, we are replacing some of the machines, you can ignore that.

_cat/master

8scZVDYGTvaHA9B4EfCYCA 10.241.0.61 10.241.0.61 prod-es-ma-1

_cat/nodes?v&h=id,name,version,jdk,node.role,master
id name version jdk node.role master
SG00 prod-es-dn-6a 5.5.2 1.8.0_151 di -
4tgD prod-es-dn-3 5.5.2 1.8.0_151 di -
rFIc prod-es-dn-5 5.5.2 1.8.0_151 di -
eoiz prod-es-dn-1a 5.5.2 1.8.0_151 di -
xz8x prod-es-dn-6 5.5.2 1.8.0_151 di -
XhZN prod-es-dn-10 5.5.2 1.8.0_151 di -
eQxy prod-es-dn-8 5.5.2 1.8.0_151 di -
8p_p prod-es-dn-10a 5.5.2 1.8.0_151 di -
9UW6 prod-es-dn-9 5.5.2 1.8.0_151 di -
xHCC prod-es-dn-11a 5.5.2 1.8.0_151 di -
8scZ prod-es-ma-1 5.5.2 1.8.0_151 m *
vNyd prod-es-ma-3 5.5.2 1.8.0_151 m -
w6qd prod-es-dn-5a 5.5.2 1.8.0_151 di -
QnKW prod-es-dn-2a 5.5.2 1.8.0_151 di -
1bZd prod-es-dn-2 5.5.2 1.8.0_151 di -
cmX prod-es-dn-7 5.5.2 1.8.0_151 di -
aAA
prod-es-dn-12a 5.5.2 1.8.0_151 di -
XYSt prod-es-dn-4a 5.5.2 1.8.0_151 di -
Bcf7 prod-es-dn-3a 5.5.2 1.8.0_151 di -
LBBL prod-es-dn-8a 5.5.2 1.8.0_151 di -
hJ11 prod-es-dn-9a 5.5.2 1.8.0_151 di -
BquO prod-es-dn-7a 5.5.2 1.8.0_151 di -
UI-s prod-es-dn-4 5.5.2 1.8.0_151 di -
cyJg prod-es-dn-1 5.5.2 1.8.0_151 di -
hgyP prod-es-ma-2 5.5.2 1.8.0_151 m -

Is there anything in the logs on the nodes that are disconnecting, e.g. long GC?

There is some GC on some of the machines, but this is inconsistent with the nodes that actually disconnect.
You can see in the log i provided in my original post.

Bump.. Any idea in which direction to dig into this issue?
Our cluster still has those errors...

[DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [prod-es-dn-7a] failed to execute on node [SG00OvqdRIaUTz_wUlHMFg]
org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.node.info.NodeInfo]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.node.i
nfo.NodeInfo]..

[o.e.t.n.Netty4Transport ] [prod-es-dn-7a] exception caught on transport layer [[id: 0x8b42e333, L:/10.241.0.76:41400 - R:10
.241.0.75/10.241.0.75:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [70056], handler [org.elasticsearch.transport.TransportService$ContextRestor
eResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1@5caf60c0], error [false]; resetting

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.