We have a logs ES cluster, 20 big nodes (i2.8xlarge, 120 GB heap, yeah, I know), split across 2 EC2 availability zones, Running 1.7.1 and cloud_aws.
Overnight, a node apparently went into a GC around 12:17 GMT, then the master timed out on cluster.node.stats call, and when the node woke back up, it said the master didn't recognize it (treating like master was lost).
This node recently had another problem that caused nearly all its shards to be recovered elsewhere, and was being rebalanced-back-in, so to speak. Its GC times logged on Datadog did not appear to be unusual (0 for old collection time, 1s for young collection time). Maybe GC was a red herring, but it seemed to be time-coincident.
Is there any setting I can change to reduce the likelihood of this happening? would a later version of 1.7 help?
Hmm, well, It's been working for us better than 31 or 30.5 GB heaps... those wind up crashing within an hour or two. We index about 32 MB/s constantly, into 10 shards on 20 nodes. About a TB a day for the primaries, 3G of docs.
When you tried running with 30GB heap, did you run multiple (3-4) nodes per host and use shard allocation awareness to distribute your shards across hosts?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.