Nodes randomly disconnected

Hans_Krijger · April 2, 2014, 4:40am

We have a cluster running 1.0.0 in Azure using unicast discovery. Recently
we started seeing exceptions like these in the logs:

[2014-04-01 21:40:22,720][DEBUG][action.admin.indices.status] [ES2PROD-M01]
[usg-2014-03-04][4], node[3cCeFKJrTMWaIhE3R6tlZA], [P], s[STARTED]: Failed
to execute [
org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@2c06e67https://github.com/org.elasticsearch.action.admin.indices.status.IndicesStatusRequest/elasticsearch/commit/2c06e675
]
org.elasticsearch.transport.NodeDisconnectedException:
[ES2PROD-D07][inet[/10.0.64.68:9300]][indices/status/s] disconnected

In this case D07 is still up and running. After several dozen of these
exceptions, D07 is disconnected:

[2014-04-01 21:40:24,096][INFO ][cluster.service ] [ES2PROD-M01] removed
{[ES2PROD-D07][3cCeFKJrTMWaIhE3R6tlZA][es2prod-d07][inet[/10.0.64.68:9300]]{master=false},},
reason:
zen-disco-node_failed([ES2PROD-D07][3cCeFKJrTMWaIhE3R6tlZA][es2prod-d07][inet[/10.0.64.68:9300]]{master=false}),
reason transport disconnected (with verified connect)

Four seconds later the same node is added back:

[2014-04-01 21:40:28,712][INFO ][cluster.service ] [ES2PROD-M01] added
{[ES2PROD-D07][3cCeFKJrTMWaIhE3R6tlZA][es2prod-d07][inet[/10.0.64.68:9300]]{master=false},},
reason: zen-disco-receive(join from
node[[ES2PROD-D07][3cCeFKJrTMWaIhE3R6tlZA][es2prod-d07][inet[/10.0.64.68:9300]]{master=false}])

In the mean time the cluster goes yellow and starts recovery. This does not
seem like a timeout type of issue since it happens so quickly, and then the
disconnected node is added right back.

Any ideas how we can get more info on the root cause and avoid this from
happening?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/df768c5e-5833-42b5-a804-b7d07f51996b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Binh_Ly_2 · April 4, 2014, 12:52pm

This could be caused be unreliable network connectivity, or if your nodes
are somehow overloaded and can't respond to other nodes in a timely manner.
In a cloud environment, this could happen more often on the lowest tier
instances. If indeed network connectivity is the cause, you can increase
ping_timeout a bit to accommodate.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ee8a8a21-6500-4979-b469-16158772ccd6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

BornGenius · April 4, 2014, 1:15pm

Hi Hans,

We were also facing this issue and the reason was that there were spikes in the network connectivity between nodes due to which the master was not able to discover the data nodes ,during zen discovery and hence removed those nodes and brought them back. You can try to increase the discovery.zen.ping.timeout that defaults to 3 seconds ,this should fix the issue.

All the best..

AryanJ

"Give users what they actually want, not what they say they want. And whatever you do, don’t give them new features just because your competitors have them!!!!!" – Kathy Sierra

Anil_Karaka · April 1, 2015, 6:09am

Hi AryanJ

What value did you have for ping.timeout, Our cluster is on AWS and we are
facing this problem for a long time. Each node leaves the cluster at least
once except for master in a day.
I am keeping it for 10secs. Can I set it to even bigger value?

Thanks.

On Friday, April 4, 2014 at 6:45:46 PM UTC+5:30, AryanJ wrote:

Hi Hans,

We were also facing this issue and the reason was that there were spikes
in
the network connectivity between nodes due to which the master was not
able
to discover the data nodes ,during zen discovery and hence removed those
nodes and brought them back. You can try to increase the
discovery.zen.ping.timeout that defaults to 3 seconds ,this should fix the
issue.

All the best..

AryanJ

"Give users what they actually want, not what they say they want. And
whatever you do, don’t give them new features just because your
competitors
have them!!!!!" – Kathy Sierra

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Nodes-randomly-disconnected-tp4053290p4053514.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fd2ce05c-3054-4943-8df0-5eea643db20e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Seeing Frequent NodeNotConnectedException errors Elasticsearch	4	11967	July 5, 2017
Elasticsearch nodes automatically disconnected Elasticsearch	2	684	August 10, 2021
Elasticsearch nodes continually disconneting/reconnecting. Resulting in very high number of unassigned shards Elasticsearch	18	2657	September 3, 2020
Nodes disconnected randomly Elasticsearch painless	1	311	September 19, 2022
Nodes randomly disconnected from the ES cluster Elasticsearch	10	7267	November 4, 2022

Nodes randomly disconnected

Related topics