ES nodes fall out of cluster periodically

Hello,

I have a recurring issue where nodes randomly fall out of cluster after a
while (usually under steady indexing load). We have 16 nodes in a cluster,
where 1 of them is a router node. All nodes are running on CENTOS with 1Gb
of RAM. I've allocated 768 Gb ram for ES. I don't have anything else other
than minimal centos install running on the nodes. My settings include:

node.name: "myname"
path:
logs: /data/elasticsearch/log
data: /data/elasticsearch/data
plugin.mandatory: mapper-attachments
multicast.enabled: false
http.enabled: false
cluster.routing.allocation.cluster_concurrent_rebalance: 4
cluster.routing.allocation.node_concurrent_recoveries: 4
discovery.zen.ping_timeout: 15s
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_interval: 3s
discovery.zen.ping.unicast.hosts: ["host1:9300","host2:9300".......]

Service wrapper:
wrapper.java.additional.1=-Delasticsearch-service
wrapper.java.additional.2=-Des.path.home=%ES_HOME%
wrapper.java.additional.3=-Xss256k
wrapper.java.additional.4=-XX:+UseParNewGC
wrapper.java.additional.5=-XX:+UseConcMarkSweepGC
wrapper.java.additional.6=-XX:CMSInitiatingOccupancyFraction=75
wrapper.java.additional.7=-XX:+UseCMSInitiatingOccupancyOnly
wrapper.java.additional.8=-XX:+HeapDumpOnOutOfMemoryError
wrapper.java.additional.9=-Djava.awt.headless=true
wrapper.java.additional.10=-XX:PermSize=128m
wrapper.java.additional.11=-XX:MaxPermSize=128m
wrapper.java.additional.12=-XX:+DisableExplicitGC
wrapper.java.additional.13=-XX:+UseCondCardMark
wrapper.java.additional.14=-XX:+CMSParallelRemarkEnabled

Java: 1.7.0_09
VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Oracle Corporation
VM version: 23.5-b02

Elasticsearch: 0.19.11

When these nodes fall out of the cluster, there is nothing obvious in the
log, and ES is up and running. Once I bounce ES on those nodes they
typically join the cluster back.

Any help or ideas of how to troubleshoot would be really appreciated.

--

Hello,

I would assume that there's either a networking issue or there's too much
load on ES for the node that falls out. You can try to rule out the
networking issue by running ping (as in "ping command") on a loop between
the master node and all the other nodes.

If system ping works fine during the time you node falls out, then I think
you should monitor your ES cluster to see what happens when the issue
reappears. Maybe it's GC or something else that's making the node
unresponsive for 3 retries (default) at a 3s interval (your configuration).
I suggest you look at our SPM for Elasticsearch:

You can also try increasing the number of retries until the node is kicked
out by changing discovery.zen.fd.ping_retries.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Tue, Dec 11, 2012 at 5:20 PM, isg8 ilya@erudites.com wrote:

Hello,

I have a recurring issue where nodes randomly fall out of cluster after a
while (usually under steady indexing load). We have 16 nodes in a cluster,
where 1 of them is a router node. All nodes are running on CENTOS with 1Gb
of RAM. I've allocated 768 Gb ram for ES. I don't have anything else other
than minimal centos install running on the nodes. My settings include:

node.name: "myname"
path:
logs: /data/elasticsearch/log
data: /data/elasticsearch/data
plugin.mandatory: mapper-attachments
multicast.enabled: false
http.enabled: false
cluster.routing.allocation.cluster_concurrent_rebalance: 4
cluster.routing.allocation.node_concurrent_recoveries: 4
discovery.zen.ping_timeout: 15s
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_interval: 3s
discovery.zen.ping.unicast.hosts: ["host1:9300","host2:9300".......]

Service wrapper:
wrapper.java.additional.1=-Delasticsearch-service
wrapper.java.additional.2=-Des.path.home=%ES_HOME%
wrapper.java.additional.3=-Xss256k
wrapper.java.additional.4=-XX:+UseParNewGC
wrapper.java.additional.5=-XX:+UseConcMarkSweepGC
wrapper.java.additional.6=-XX:CMSInitiatingOccupancyFraction=75
wrapper.java.additional.7=-XX:+UseCMSInitiatingOccupancyOnly
wrapper.java.additional.8=-XX:+HeapDumpOnOutOfMemoryError
wrapper.java.additional.9=-Djava.awt.headless=true
wrapper.java.additional.10=-XX:PermSize=128m
wrapper.java.additional.11=-XX:MaxPermSize=128m
wrapper.java.additional.12=-XX:+DisableExplicitGC
wrapper.java.additional.13=-XX:+UseCondCardMark
wrapper.java.additional.14=-XX:+CMSParallelRemarkEnabled

Java: 1.7.0_09
VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Oracle Corporation
VM version: 23.5-b02

Elasticsearch: 0.19.11

When these nodes fall out of the cluster, there is nothing obvious in the
log, and ES is up and running. Once I bounce ES on those nodes they
typically join the cluster back.

Any help or ideas of how to troubleshoot would be really appreciated.

--

--