Data node constantly dropping out of the cluster

DavidTurner · August 7, 2019, 12:40pm

Thanks. Focussing on this data node, the master reports this kind of loop over and over again:

[2019-08-07T02:00:58,410][INFO ][o.e.c.s.MasterService    ] [elastic-master-prod-us-central1-kkmd] node-join[{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold} join existing leader], term: 9, version: 1654, reason: added {{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold},}
[2019-08-07T02:01:28,422][INFO ][o.e.c.s.ClusterApplierService] [elastic-master-prod-us-central1-kkmd] added {{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold},}, term: 9, version: 1654, reason: Publication{term=9, version=1654}
[2019-08-07T02:01:28,427][WARN ][o.e.c.s.MasterService    ] [elastic-master-prod-us-central1-kkmd] cluster state update task [node-join[{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold} join existing leader]] took [30s] which is above the warn threshold of 30s
[2019-08-07T02:02:58,573][INFO ][o.e.c.s.MasterService    ] [elastic-master-prod-us-central1-kkmd] node-left[{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold} lagging], term: 9, version: 1658, reason: removed {{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold},}
[2019-08-07T02:02:58,928][INFO ][o.e.c.s.ClusterApplierService] [elastic-master-prod-us-central1-kkmd] removed {{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold},}, term: 9, version: 1658, reason: Publication{term=9, version=1658}

Note that the reason for removal is the well-hidden word lagging indicating that this node failed to apply a cluster state update within the 2 minute lag timeout. In 7.4.0 this gets logged more noisily at WARN level.

Looking at the node's logs we see some egregiously slow cluster state applications:

[2019-08-07T02:11:27,946][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1631, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [17.9m] which is above the warn threshold of 30s
[2019-08-07T02:12:39,294][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1687, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [52.4s] which is above the warn threshold of 30s
[2019-08-07T02:13:52,351][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1693, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [57.3s] which is above the warn threshold of 30s
[2019-08-07T02:32:01,205][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1705, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [17m] which is above the warn threshold of 30s
[2019-08-07T02:49:20,181][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1756, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [17.2m] which is above the warn threshold of 30s
[2019-08-07T03:06:23,071][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1801, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [16.9m] which is above the warn threshold of 30s
[2019-08-07T03:24:05,209][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1844, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [17.6m] which is above the warn threshold of 30s
[2019-08-07T03:40:05,324][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1889, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [15.8m] which is above the warn threshold of 30s

Taking ≥ 15 minutes to apply a cluster state is very bad. logger.org.elasticsearch.cluster.service: TRACE on the data node will tell us which applier is so slow. The default logging here is also improved in 7.4.

Topic		Replies	Views
Cluster takes too long to apply cluster state Elasticsearch	27	1771	July 4, 2023
Elasticsearch cluster data nodes are being removed from cluster by timeout Elasticsearch	18	1430	November 11, 2021
After upgrading from ES 7, it is easy for a node to be unable to join the cluster for a long time after leaving. Restart the problem node and join it immediately Elasticsearch	14	268	July 7, 2025
Cluster goes into red, some shards in initializing state Elasticsearch	8	2006	July 5, 2017
ES persistent outages Elasticsearch	6	1475	April 19, 2023

Data node constantly dropping out of the cluster

Related topics