We have a logs ES cluster, 20 big nodes (i2.8xlarge, 120 GB heap, yeah, I know), split across 2 EC2 availability zones, Running 1.7.1 and cloud_aws.
Overnight, a node apparently went into a GC around 12:17 GMT, then the master timed out on cluster.node.stats call, and when the node woke back up, it said the master didn't recognize it (treating like master was lost).
This node recently had another problem that caused nearly all its shards to be recovered elsewhere, and was being rebalanced-back-in, so to speak. Its GC times logged on Datadog did not appear to be unusual (0 for old collection time, 1s for young collection time). Maybe GC was a red herring, but it seemed to be time-coincident.
Is there any setting I can change to reduce the likelihood of this happening? would a later version of 1.7 help?