Nodes randomly disconnected

We have a cluster running 1.0.0 in Azure using unicast discovery. Recently
we started seeing exceptions like these in the logs:

[2014-04-01 21:40:22,720][DEBUG][action.admin.indices.status] [ES2PROD-M01]
[usg-2014-03-04][4], node[3cCeFKJrTMWaIhE3R6tlZA], [P], s[STARTED]: Failed
to execute [
org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@2c06e67https://github.com/org.elasticsearch.action.admin.indices.status.IndicesStatusRequest/elasticsearch/commit/2c06e675
]
org.elasticsearch.transport.NodeDisconnectedException:
[ES2PROD-D07][inet[/10.0.64.68:9300]][indices/status/s] disconnected

In this case D07 is still up and running. After several dozen of these
exceptions, D07 is disconnected:

[2014-04-01 21:40:24,096][INFO ][cluster.service ] [ES2PROD-M01] removed
{[ES2PROD-D07][3cCeFKJrTMWaIhE3R6tlZA][es2prod-d07][inet[/10.0.64.68:9300]]{master=false},},
reason:
zen-disco-node_failed([ES2PROD-D07][3cCeFKJrTMWaIhE3R6tlZA][es2prod-d07][inet[/10.0.64.68:9300]]{master=false}),
reason transport disconnected (with verified connect)

Four seconds later the same node is added back:

[2014-04-01 21:40:28,712][INFO ][cluster.service ] [ES2PROD-M01] added
{[ES2PROD-D07][3cCeFKJrTMWaIhE3R6tlZA][es2prod-d07][inet[/10.0.64.68:9300]]{master=false},},
reason: zen-disco-receive(join from
node[[ES2PROD-D07][3cCeFKJrTMWaIhE3R6tlZA][es2prod-d07][inet[/10.0.64.68:9300]]{master=false}])

In the mean time the cluster goes yellow and starts recovery. This does not
seem like a timeout type of issue since it happens so quickly, and then the
disconnected node is added right back.

Any ideas how we can get more info on the root cause and avoid this from
happening?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/df768c5e-5833-42b5-a804-b7d07f51996b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

This could be caused be unreliable network connectivity, or if your nodes
are somehow overloaded and can't respond to other nodes in a timely manner.
In a cloud environment, this could happen more often on the lowest tier
instances. If indeed network connectivity is the cause, you can increase
ping_timeout a bit to accommodate.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ee8a8a21-6500-4979-b469-16158772ccd6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Hans,

We were also facing this issue and the reason was that there were spikes in the network connectivity between nodes due to which the master was not able to discover the data nodes ,during zen discovery and hence removed those nodes and brought them back. You can try to increase the discovery.zen.ping.timeout that defaults to 3 seconds ,this should fix the issue.

All the best..

AryanJ

"Give users what they actually want, not what they say they want. And whatever you do, don’t give them new features just because your competitors have them!!!!!" – Kathy Sierra

Hi AryanJ

What value did you have for ping.timeout, Our cluster is on AWS and we are
facing this problem for a long time. Each node leaves the cluster at least
once except for master in a day.
I am keeping it for 10secs. Can I set it to even bigger value?

Thanks.

On Friday, April 4, 2014 at 6:45:46 PM UTC+5:30, AryanJ wrote:

Hi Hans,

We were also facing this issue and the reason was that there were spikes
in
the network connectivity between nodes due to which the master was not
able
to discover the data nodes ,during zen discovery and hence removed those
nodes and brought them back. You can try to increase the
discovery.zen.ping.timeout that defaults to 3 seconds ,this should fix the
issue.

All the best..

AryanJ

"Give users what they actually want, not what they say they want. And
whatever you do, don’t give them new features just because your
competitors
have them!!!!!" – Kathy Sierra

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Nodes-randomly-disconnected-tp4053290p4053514.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fd2ce05c-3054-4943-8df0-5eea643db20e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.