Nodes keep disconnecting from cluster at random

Paul_Zaltsman · November 29, 2017, 12:22pm

Hi, i have an issue with a 5.5.2 ES cluster.
We see a lot of disconnects between our nodes.
Even sending a simple _nodes query causes random disconnections. In this example. Only the node i've asked answered. Rest failed.

We have 10 Data nodes and 3 Masters in a cluster. 40 indexes with 300shards. 2TB size cluster All in the same Google datacenter.
Os is centos7.3.1611 core.
No network congestion or any unusual behavior and yet this a result of a GET _nodes query.
At this point some nodes disconnect and the cluster becomes yellow with unassigned shards.
All servers are 5.5.2 jvm1.8.0_151.

id v
4tgD 5.5.2
9UW6 5.5.2
8scZ 5.5.2
rFIc 5.5.2
XhZN 5.5.2
vNyd 5.5.2
_cmX 5.5.2
UI-s 5.5.2
eQxy 5.5.2
hgyP 5.5.2
xz8x 5.5.2
1bZd 5.5.2
cyJg 5.5.2

Paul_Zaltsman · November 29, 2017, 12:24pm

Didn't let me paste the whole node info.
https://pastebin.com/xLP99vXM

Paul_Zaltsman · November 29, 2017, 12:27pm

Stack trace of one of the nodes while running this
https://pastebin.com/wXs5SgNp

spinscale · December 1, 2017, 8:42am

Hey,

thats an odd one. The exception means, that a serialized data stream could not be read from another node. This should not happen, especially not when you have the same Elasticsearch versions everywhere.

Can you run

GET _cat/master
GET _cat/nodes?v&h=id,name,version,jdk,node.role,master

Paul_Zaltsman · December 4, 2017, 10:20am

Hi. Yes, i'm aware this shouldn't happen. But it still does.
You may notice the cluster is bigger than i initially said, we are replacing some of the machines, you can ignore that.

_cat/master

8scZVDYGTvaHA9B4EfCYCA 10.241.0.61 10.241.0.61 prod-es-ma-1

_cat/nodes?v&h=id,name,version,jdk,node.role,master
id name version jdk node.role master
SG00 prod-es-dn-6a 5.5.2 1.8.0_151 di -
4tgD prod-es-dn-3 5.5.2 1.8.0_151 di -
rFIc prod-es-dn-5 5.5.2 1.8.0_151 di -
eoiz prod-es-dn-1a 5.5.2 1.8.0_151 di -
xz8x prod-es-dn-6 5.5.2 1.8.0_151 di -
XhZN prod-es-dn-10 5.5.2 1.8.0_151 di -
eQxy prod-es-dn-8 5.5.2 1.8.0_151 di -
8p_p prod-es-dn-10a 5.5.2 1.8.0_151 di -
9UW6 prod-es-dn-9 5.5.2 1.8.0_151 di -
xHCC prod-es-dn-11a 5.5.2 1.8.0_151 di -
8scZ prod-es-ma-1 5.5.2 1.8.0_151 m *
vNyd prod-es-ma-3 5.5.2 1.8.0_151 m -
w6qd prod-es-dn-5a 5.5.2 1.8.0_151 di -
QnKW prod-es-dn-2a 5.5.2 1.8.0_151 di -
1bZd prod-es-dn-2 5.5.2 1.8.0_151 di -
cmX prod-es-dn-7 5.5.2 1.8.0_151 di -
aAA prod-es-dn-12a 5.5.2 1.8.0_151 di -
XYSt prod-es-dn-4a 5.5.2 1.8.0_151 di -
Bcf7 prod-es-dn-3a 5.5.2 1.8.0_151 di -
LBBL prod-es-dn-8a 5.5.2 1.8.0_151 di -
hJ11 prod-es-dn-9a 5.5.2 1.8.0_151 di -
BquO prod-es-dn-7a 5.5.2 1.8.0_151 di -
UI-s prod-es-dn-4 5.5.2 1.8.0_151 di -
cyJg prod-es-dn-1 5.5.2 1.8.0_151 di -
hgyP prod-es-ma-2 5.5.2 1.8.0_151 m -

Christian_Dahlqvist · December 4, 2017, 10:52am

Is there anything in the logs on the nodes that are disconnecting, e.g. long GC?

Paul_Zaltsman · December 4, 2017, 11:15am

There is some GC on some of the machines, but this is inconsistent with the nodes that actually disconnect.
You can see in the log i provided in my original post.

Paul_Zaltsman · December 7, 2017, 11:19am

Bump.. Any idea in which direction to dig into this issue?
Our cluster still has those errors...

[DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [prod-es-dn-7a] failed to execute on node [SG00OvqdRIaUTz_wUlHMFg]
org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.node.info.NodeInfo]]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.node.i
nfo.NodeInfo]..

[o.e.t.n.Netty4Transport ] [prod-es-dn-7a] exception caught on transport layer [[id: 0x8b42e333, L:/10.241.0.76:41400 - R:10
.241.0.75/10.241.0.75:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [70056], handler [org.elasticsearch.transport.TransportService$ContextRestor
eResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1@5caf60c0], error [false]; resetting

system · January 4, 2018, 11:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES 1.4.2 random node disconnect Elasticsearch	1	361	July 6, 2017
ES 1.4.2 random node disconnect Elasticsearch	4	417	July 6, 2017
Nodes disconnected randomly Elasticsearch painless	1	311	September 19, 2022
Data Nodes disconnected randomly Elasticsearch	3	229	March 9, 2023
Elasticsearch nodes continually disconneting/reconnecting. Resulting in very high number of unassigned shards Elasticsearch	18	2657	September 3, 2020

Nodes keep disconnecting from cluster at random

Related topics