I have a recurring issue where nodes randomly fall out of cluster after a
while (usually under steady indexing load). We have 16 nodes in a cluster,
where 1 of them is a router node. All nodes are running on CENTOS with 1Gb
of RAM. I've allocated 768 Gb ram for ES. I don't have anything else other
than minimal centos install running on the nodes. My settings include:
Java: 1.7.0_09
VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Oracle Corporation
VM version: 23.5-b02
Elasticsearch: 0.19.11
When these nodes fall out of the cluster, there is nothing obvious in the
log, and ES is up and running. Once I bounce ES on those nodes they
typically join the cluster back.
Any help or ideas of how to troubleshoot would be really appreciated.
I would assume that there's either a networking issue or there's too much
load on ES for the node that falls out. You can try to rule out the
networking issue by running ping (as in "ping command") on a loop between
the master node and all the other nodes.
If system ping works fine during the time you node falls out, then I think
you should monitor your ES cluster to see what happens when the issue
reappears. Maybe it's GC or something else that's making the node
unresponsive for 3 retries (default) at a 3s interval (your configuration).
I suggest you look at our SPM for Elasticsearch:
You can also try increasing the number of retries until the node is kicked
out by changing discovery.zen.fd.ping_retries.
I have a recurring issue where nodes randomly fall out of cluster after a
while (usually under steady indexing load). We have 16 nodes in a cluster,
where 1 of them is a router node. All nodes are running on CENTOS with 1Gb
of RAM. I've allocated 768 Gb ram for ES. I don't have anything else other
than minimal centos install running on the nodes. My settings include:
Java: 1.7.0_09
VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Oracle Corporation
VM version: 23.5-b02
Elasticsearch: 0.19.11
When these nodes fall out of the cluster, there is nothing obvious in the
log, and ES is up and running. Once I bounce ES on those nodes they
typically join the cluster back.
Any help or ideas of how to troubleshoot would be really appreciated.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.