Node "timeout" possibly due to GC?

We have a logs ES cluster, 20 big nodes (i2.8xlarge, 120 GB heap, yeah, I know), split across 2 EC2 availability zones, Running 1.7.1 and cloud_aws.

Overnight, a node apparently went into a GC around 12:17 GMT, then the master timed out on cluster.node.stats call, and when the node woke back up, it said the master didn't recognize it (treating like master was lost).

This node recently had another problem that caused nearly all its shards to be recovered elsewhere, and was being rebalanced-back-in, so to speak. Its GC times logged on Datadog did not appear to be unusual (0 for old collection time, 1s for young collection time). Maybe GC was a red herring, but it seemed to be time-coincident.

Is there any setting I can change to reduce the likelihood of this happening? would a later version of 1.7 help?

Yes, don't run with a 120 GB heap. It's big, it's waiting to be filled with garbage, and it's going to take a long time to collect.

Unfortunately, no.

Hmm, well, It's been working for us better than 31 or 30.5 GB heaps... those wind up crashing within an hour or two. We index about 32 MB/s constantly, into 10 shards on 20 nodes. About a TB a day for the primaries, 3G of docs.

Does 2.x help with this?

When you tried running with 30GB heap, did you run multiple (3-4) nodes per host and use shard allocation awareness to distribute your shards across hosts?

That strategy did not occur to us. Thank you for mentioning it! Will have to give it a go.

Do you know a rule of thumb for how much indexing per node (in MB/s) is reasonable?