I have a 3 node cluster running ES 1.0.1 in Azure. They're windows VMs
with 7GB of RAM. The JVM heap size is allocated at 4GB per node. There is
a single index in the cluster with 50 shards and 1 replica. The total
number of documents on primary shards is 29 million with a store size of
60gb (including replicas).
Almost every day now I get a random node disconnecting from the cluster.
The usual suspect is a ping timeout. The longest GC in the logs is about 1
sec, and the boxes don't look resource constrained really at all. CPU never
goes above 20%. The used JVM heap size never goes above 6gb (the total on
the cluster is 12gb) and the field data cache never gets over 1gb. The
node that drops out is different every day. I have
minimum_number_master_nodes set so there's not any kind of split brain
scenario, but there are times where the disconnected node NEVER rejoins
until I bounce the process.
Has anyone seen this before? Is it an Azure networking issue? How can I
tell? If it's resource problems, what's the best way for me to turn on
logging to diagnose them? What else can I tell you or what other steps can
I take to figure this out? It's really quite maddening
I have a 3 node cluster running ES 1.0.1 in Azure. They're windows VMs with 7GB of RAM. The JVM heap size is allocated at 4GB per node. There is a single index in the cluster with 50 shards and 1 replica. The total number of documents on primary shards is 29 million with a store size of 60gb (including replicas).
Almost every day now I get a random node disconnecting from the cluster. The usual suspect is a ping timeout. The longest GC in the logs is about 1 sec, and the boxes don't look resource constrained really at all. CPU never goes above 20%. The used JVM heap size never goes above 6gb (the total on the cluster is 12gb) and the field data cache never gets over 1gb. The node that drops out is different every day. I have minimum_number_master_nodes set so there's not any kind of split brain scenario, but there are times where the disconnected node NEVER rejoins until I bounce the process.
Has anyone seen this before? Is it an Azure networking issue? How can I tell? If it's resource problems, what's the best way for me to turn on logging to diagnose them? What else can I tell you or what other steps can I take to figure this out? It's really quite maddening
I'm using the unicast list of nodes at the moment. I have multicast turned
off as well. I have not changed the default ping timeout or anything.
On Thursday, May 29, 2014 7:37:38 PM UTC-5, David Pilato wrote:
Just checking: are you using azure cloud plugin or unicast list of nodes?
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Le 30 mai 2014 à 02:12, Eric Brandes <eric.b...@gmail.com <javascript:>>
a écrit :
I have a 3 node cluster running ES 1.0.1 in Azure. They're windows VMs
with 7GB of RAM. The JVM heap size is allocated at 4GB per node. There is
a single index in the cluster with 50 shards and 1 replica. The total
number of documents on primary shards is 29 million with a store size of
60gb (including replicas).
Almost every day now I get a random node disconnecting from the cluster.
The usual suspect is a ping timeout. The longest GC in the logs is about 1
sec, and the boxes don't look resource constrained really at all. CPU never
goes above 20%. The used JVM heap size never goes above 6gb (the total on
the cluster is 12gb) and the field data cache never gets over 1gb. The
node that drops out is different every day. I have
minimum_number_master_nodes set so there's not any kind of split brain
scenario, but there are times where the disconnected node NEVER rejoins
until I bounce the process.
Has anyone seen this before? Is it an Azure networking issue? How can I
tell? If it's resource problems, what's the best way for me to turn on
logging to diagnose them? What else can I tell you or what other steps can
I take to figure this out? It's really quite maddening
The three nodes are connected by an Azure virtual network. They are all
part of a single cloud service, operating in a load balanced set. I am not
currently using any kind of FQDN, so the unicast host names are
"es-machine-1", "es-machine-2" etc. No domain suffix whatsoever. As far as
I know that is end-arounding the public load balancer (since none of those
hostnames are publicly accessible to machines outside the virtual
network). But I've been wrong before I actually can't find any kind of
fully qualified domain name for those machines, other than the public
facing cloudapp.net one, so I assume this is OK? I've also tried using the
internal virtual network IP addresses on a similarly specced development
cluster, and I see the same timeouts there.
On Friday, May 30, 2014 1:40:47 AM UTC-5, Michael Delaney wrote:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.