Long GC on one node seems to block writes to other nodes

ES version: 6.2

Short summary: A data node not responding quickly to cluster state update requests from the master node (due to long GC on the data node) seems to stop the indexing of new data to other nodes. I want to learn why an unresponsive node seems to block indexing new data into the other nodes in the cluster.

More background:

The ES nodes that we index new data onto have a 32 GB of memory, and we set ES's heap to half of that, 16 GB, in accordance with this doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html

We recently ran into a problem where a node would be GCing a lot and failing to free any memory from old gen. We came to that conclusion because of logs like these two on the node in question:

[gc][old][99376][1698] duration [13.1s], collections [1]/[13.4s], total [13.1s]/[10m], memory [14.4gb]->[14.4gb]/[15.1gb], all_pools {[young] [276.4mb]->[294.3mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.1gb]->[14.1gb]/[14.1gb]}

[gc][99375] overhead, spent [11.9s] collecting in the last [12.5s]

This GC problem would trigger a second problem: because the node is GCing a lot, it's not very responsive to things like cluster state updates. We started seeing logs like these on the master node:

timed out waiting for all nodes to process published state [4825852] (timeout [30s], pending nodes: [{prod-es-data-hot-080c}{k-ciTBhaRja47SJJP725Kw}{BmJo7PjqTZmS_77P19JcSw}{10.4.0.156}{10.4.0.156:9300}{aws_availability_zone=us-west-2a, data_type=hot}])

And because the node in question was not responding to cluster state updates, it seemed to be that we could not index any new data. Why would this be the case? It would seem like, while we may not be able to send writes to the node that is stuck GCing a lot, we should be able to write data to other nodes at least.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.