Random node disconnects in Azure, no resource issues as near as I can tell


(Eric Brandes) #1

I have a 3 node cluster running ES 1.0.1 in Azure. They're windows VMs
with 7GB of RAM. The JVM heap size is allocated at 4GB per node. There is
a single index in the cluster with 50 shards and 1 replica. The total
number of documents on primary shards is 29 million with a store size of
60gb (including replicas).

Almost every day now I get a random node disconnecting from the cluster.
The usual suspect is a ping timeout. The longest GC in the logs is about 1
sec, and the boxes don't look resource constrained really at all. CPU never
goes above 20%. The used JVM heap size never goes above 6gb (the total on
the cluster is 12gb) and the field data cache never gets over 1gb. The
node that drops out is different every day. I have
minimum_number_master_nodes set so there's not any kind of split brain
scenario, but there are times where the disconnected node NEVER rejoins
until I bounce the process.

Has anyone seen this before? Is it an Azure networking issue? How can I
tell? If it's resource problems, what's the best way for me to turn on
logging to diagnose them? What else can I tell you or what other steps can
I take to figure this out? It's really quite maddening :frowning:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #2

Just checking: are you using azure cloud plugin or unicast list of nodes?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 30 mai 2014 à 02:12, Eric Brandes eric.brandes@gmail.com a écrit :

I have a 3 node cluster running ES 1.0.1 in Azure. They're windows VMs with 7GB of RAM. The JVM heap size is allocated at 4GB per node. There is a single index in the cluster with 50 shards and 1 replica. The total number of documents on primary shards is 29 million with a store size of 60gb (including replicas).

Almost every day now I get a random node disconnecting from the cluster. The usual suspect is a ping timeout. The longest GC in the logs is about 1 sec, and the boxes don't look resource constrained really at all. CPU never goes above 20%. The used JVM heap size never goes above 6gb (the total on the cluster is 12gb) and the field data cache never gets over 1gb. The node that drops out is different every day. I have minimum_number_master_nodes set so there's not any kind of split brain scenario, but there are times where the disconnected node NEVER rejoins until I bounce the process.

Has anyone seen this before? Is it an Azure networking issue? How can I tell? If it's resource problems, what's the best way for me to turn on logging to diagnose them? What else can I tell you or what other steps can I take to figure this out? It's really quite maddening :frowning:

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/DE1520AB-0E38-440A-869C-A69ECE9A5295%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(Eric Brandes) #3

I'm using the unicast list of nodes at the moment. I have multicast turned
off as well. I have not changed the default ping timeout or anything.

On Thursday, May 29, 2014 7:37:38 PM UTC-5, David Pilato wrote:

Just checking: are you using azure cloud plugin or unicast list of nodes?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 30 mai 2014 à 02:12, Eric Brandes <eric.b...@gmail.com <javascript:>>
a écrit :

I have a 3 node cluster running ES 1.0.1 in Azure. They're windows VMs
with 7GB of RAM. The JVM heap size is allocated at 4GB per node. There is
a single index in the cluster with 50 shards and 1 replica. The total
number of documents on primary shards is 29 million with a store size of
60gb (including replicas).

Almost every day now I get a random node disconnecting from the cluster.
The usual suspect is a ping timeout. The longest GC in the logs is about 1
sec, and the boxes don't look resource constrained really at all. CPU never
goes above 20%. The used JVM heap size never goes above 6gb (the total on
the cluster is 12gb) and the field data cache never gets over 1gb. The
node that drops out is different every day. I have
minimum_number_master_nodes set so there's not any kind of split brain
scenario, but there are times where the disconnected node NEVER rejoins
until I bounce the process.

Has anyone seen this before? Is it an Azure networking issue? How can I
tell? If it's resource problems, what's the best way for me to turn on
logging to diagnose them? What else can I tell you or what other steps can
I take to figure this out? It's really quite maddening :frowning:

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7671194d-3059-4220-9da5-c4e1aa169072%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Michael Delaney) #4

Are u using internal fully qualified domain names, e.g es01.myelasticsearcservice.f3.internal.net
If you use public load balancer end points you'll get timeouts.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d6b7a52d-84a8-46d3-a42f-2a708922e567%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Eric Brandes) #5

The three nodes are connected by an Azure virtual network. They are all
part of a single cloud service, operating in a load balanced set. I am not
currently using any kind of FQDN, so the unicast host names are
"es-machine-1", "es-machine-2" etc. No domain suffix whatsoever. As far as
I know that is end-arounding the public load balancer (since none of those
hostnames are publicly accessible to machines outside the virtual
network). But I've been wrong before :slight_smile: I actually can't find any kind of
fully qualified domain name for those machines, other than the public
facing cloudapp.net one, so I assume this is OK? I've also tried using the
internal virtual network IP addresses on a similarly specced development
cluster, and I see the same timeouts there.

On Friday, May 30, 2014 1:40:47 AM UTC-5, Michael Delaney wrote:

Are u using internal fully qualified domain names, e.g
es01.myelasticsearcservice.f3.internal.net
If you use public load balancer end points you'll get timeouts.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dd26798c-66ef-4881-88ea-72d9df2e16a0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6