Thanks. Focussing on this data node, the master reports this kind of loop over and over again:
[2019-08-07T02:00:58,410][INFO ][o.e.c.s.MasterService ] [elastic-master-prod-us-central1-kkmd] node-join[{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold} join existing leader], term: 9, version: 1654, reason: added {{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold},}
[2019-08-07T02:01:28,422][INFO ][o.e.c.s.ClusterApplierService] [elastic-master-prod-us-central1-kkmd] added {{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold},}, term: 9, version: 1654, reason: Publication{term=9, version=1654}
[2019-08-07T02:01:28,427][WARN ][o.e.c.s.MasterService ] [elastic-master-prod-us-central1-kkmd] cluster state update task [node-join[{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold} join existing leader]] took [30s] which is above the warn threshold of 30s
[2019-08-07T02:02:58,573][INFO ][o.e.c.s.MasterService ] [elastic-master-prod-us-central1-kkmd] node-left[{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold} lagging], term: 9, version: 1658, reason: removed {{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold},}
[2019-08-07T02:02:58,928][INFO ][o.e.c.s.ClusterApplierService] [elastic-master-prod-us-central1-kkmd] removed {{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold},}, term: 9, version: 1658, reason: Publication{term=9, version=1658}
Note that the reason for removal is the well-hidden word lagging indicating that this node failed to apply a cluster state update within the 2 minute lag timeout. In 7.4.0 this gets logged more noisily at WARN level.
Looking at the node's logs we see some egregiously slow cluster state applications:
[2019-08-07T02:11:27,946][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1631, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [17.9m] which is above the warn threshold of 30s
[2019-08-07T02:12:39,294][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1687, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [52.4s] which is above the warn threshold of 30s
[2019-08-07T02:13:52,351][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1693, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [57.3s] which is above the warn threshold of 30s
[2019-08-07T02:32:01,205][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1705, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [17m] which is above the warn threshold of 30s
[2019-08-07T02:49:20,181][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1756, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [17.2m] which is above the warn threshold of 30s
[2019-08-07T03:06:23,071][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1801, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [16.9m] which is above the warn threshold of 30s
[2019-08-07T03:24:05,209][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1844, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [17.6m] which is above the warn threshold of 30s
[2019-08-07T03:40:05,324][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1889, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [15.8m] which is above the warn threshold of 30s
Taking ≥ 15 minutes to apply a cluster state is very bad. logger.org.elasticsearch.cluster.service: TRACE on the data node will tell us which applier is so slow. The default logging here is also improved in 7.4.