Node "timeout" possibly due to GC?

ecweaver · February 10, 2016, 7:14pm

We have a logs ES cluster, 20 big nodes (i2.8xlarge, 120 GB heap, yeah, I know), split across 2 EC2 availability zones, Running 1.7.1 and cloud_aws.

Overnight, a node apparently went into a GC around 12:17 GMT, then the master timed out on cluster.node.stats call, and when the node woke back up, it said the master didn't recognize it (treating like master was lost).

This node recently had another problem that caused nearly all its shards to be recovered elsewhere, and was being rebalanced-back-in, so to speak. Its GC times logged on Datadog did not appear to be unusual (0 for old collection time, 1s for young collection time). Maybe GC was a red herring, but it seemed to be time-coincident.

Is there any setting I can change to reduce the likelihood of this happening? would a later version of 1.7 help?

jasontedor · February 10, 2016, 10:20pm

Yes, don't run with a 120 GB heap. It's big, it's waiting to be filled with garbage, and it's going to take a long time to collect.

Unfortunately, no.

ecweaver · February 10, 2016, 11:55pm

Hmm, well, It's been working for us better than 31 or 30.5 GB heaps... those wind up crashing within an hour or two. We index about 32 MB/s constantly, into 10 shards on 20 nodes. About a TB a day for the primaries, 3G of docs.

Does 2.x help with this?

Christian_Dahlqvist · February 11, 2016, 5:23am

When you tried running with 30GB heap, did you run multiple (3-4) nodes per host and use shard allocation awareness to distribute your shards across hosts?

ecweaver · February 11, 2016, 4:08pm

That strategy did not occur to us. Thank you for mentioning it! Will have to give it a go.

Do you know a rule of thumb for how much indexing per node (in MB/s) is reasonable?

Topic		Replies	Views
Debugging performance decrease after a node fault Elasticsearch	4	634	February 3, 2018
Continous GC on Master Node Elasticsearch	7	865	October 4, 2018
Long period of querying failure during node timeout Elasticsearch	4	1040	May 15, 2020
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	954	July 6, 2017
ES 7.4 GC keeps reclaiming less memory on each pass Elasticsearch	9	530	May 24, 2020

Node "timeout" possibly due to GC?

Related topics