Data node constantly dropping out of the cluster

We had a situation after migrating to Elasticsearch 7.3. We run 3 dedicated masters and 4 data nodes. Each data node has 24 vCPUs and 30GB of RAM, used primarily for ingesting logs. One of our data nodes was constantly being dropped from the cluster. It only happened to this single node. We are running in a Google Cloud Platform. We initially suspected some load issue that prevented the data nodes from acknowledging the state of the master or some networking issue, but we couldn't find any evidence of the latter and increasing the CPUs for the data nodes didn't help.

We think that we found the issue (or a sympthom), but its a bit puzzling and I would like to get some feedback/opinions if possible. Let me provide a bit of context:

After the migration to v7.3 one of our templates was not set up properly and an index (a big one) was created with a single shard (>200GB). This shard was allocated to the node that was constantly disconnecting from the cluster. After checking the logs we noticed the following:

    [2019-08-07T02:13:24,055][WARN ][o.e.i.e.Engine           ] [elastic-data4-prod-us-central1-55q5] [hsc-searchresultlogs-prod_2019.08.06][0] failed engine [exception during primary-replica resync]
    org.elasticsearch.transport.RemoteTransportException: [elastic-data4-prod-us-central1-55q5][10.248.24.143:9300][internal:index/seq_no/resync[p]]
    Caused by: org.elasticsearch.action.UnavailableShardsException: [hsc-searchresultlogs-prod_2019.08.06][0] Not enough active copies to meet shard count of [DEFAULT] (have 0, needed DEFAULT). Timeout: [1m], request: [TransportResyncReplicationAction.Request{shardId=[hsc-searchresultlogs-prod_2019.08.06][0], timeout=1m, index='hsc-searchresultlogs-prod_2019.08.06', trimAboveSeqNo=22351547, maxSeenAutoIdTimestampOnPrimary=1565137505204, ops=0}]
        at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:102) [elasticsearch-7.3.0.jar:7.3.0]
...
        at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2580) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:864) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:312) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:275) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:703) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.0.jar:7.3.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
    [2019-08-07T02:13:24,088][WARN ][o.e.t.ThreadPool         ] [elastic-data4-prod-us-central1-55q5] failed to run scheduled task [org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker@20dd1994] on thread pool [same]
    org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
...
        at org.elasticsearch.indices.IndexingMemoryController.getShardWritingBytes(IndexingMemoryController.java:182) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker.runUnlocked(IndexingMemoryController.java:310) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:290) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:225) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.0.jar:7.3.0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
    [2019-08-07T02:13:29,090][WARN ][o.e.t.ThreadPool         ] [elastic-data4-prod-us-central1-55q5] failed to run scheduled task [org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker@20dd1994] on thread pool [same]
    org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
        at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:681) ~[lucene-core-

Seems like the translog of that shard was corrupted. After we deleted the index everything has been running stable.

Is this a plausible scenario? a single corrupted shard's translog causing the entire data node to get kicked from the cluster? We run normally with replication of 1 and 5 shards but still, it is a bit strange that the data node didn't OOM or anything it was just dropping out of the cluster :thinking:.

Anyhow, I would appreciate any input or comments on this.

I don't see how a corrupted translog would cause a data node to be removed from the cluster, but they could perhaps be symptoms of a common cause. The log messages you've shared aren't very informative - they are showing some things failing because a shard has already failed, but don't tell us anything about why the shard failed nor do they say anything about a node leaving the cluster.

Can you share all the logs from both the problematic node and the elected master? Use https://gist.github.com/ since there'll be too much to share in a post here.

First of all thanks for the quick reply!
I added the link to the full log and removed it by accident because I was over the 7k character limit :man_facepalming:t2:.

Here is the full log of the data node:
https://send.firefox.com/download/d999f520d6ba35e4/#H8Esm--U46aCGdZ8XXXCWA.

Logs from a master node with the following logging:

logger.org.elasticsearch.cluster.service: TRACE
logger.org.elasticsearch.gateway.MetaStateService: TRACE

enabled.

The master logs you shared are from elastic-master-prod-us-central1-p7fx which was not the elected master. It looks like the elected master was elastic-master-prod-us-central1-kkmd from 2019-08-07T01:37:53,036 onwards.

Yep, you're right. I have the logs for the elected master but without the TRACE enabled:

Thanks. Focussing on this data node, the master reports this kind of loop over and over again:

[2019-08-07T02:00:58,410][INFO ][o.e.c.s.MasterService    ] [elastic-master-prod-us-central1-kkmd] node-join[{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold} join existing leader], term: 9, version: 1654, reason: added {{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold},}
[2019-08-07T02:01:28,422][INFO ][o.e.c.s.ClusterApplierService] [elastic-master-prod-us-central1-kkmd] added {{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold},}, term: 9, version: 1654, reason: Publication{term=9, version=1654}
[2019-08-07T02:01:28,427][WARN ][o.e.c.s.MasterService    ] [elastic-master-prod-us-central1-kkmd] cluster state update task [node-join[{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold} join existing leader]] took [30s] which is above the warn threshold of 30s
[2019-08-07T02:02:58,573][INFO ][o.e.c.s.MasterService    ] [elastic-master-prod-us-central1-kkmd] node-left[{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold} lagging], term: 9, version: 1658, reason: removed {{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold},}
[2019-08-07T02:02:58,928][INFO ][o.e.c.s.ClusterApplierService] [elastic-master-prod-us-central1-kkmd] removed {{elastic-data4-prod-us-central1-55q5}{wUq0jVSxTKydhut1TptECQ}{SIHJvzXmTjuoqeleoLiJ9A}{10.248.24.143}{10.248.24.143:9300}{d}{ml.machine_memory=31625756672, ml.max_open_jobs=20, xpack.installed=true, box_type=cold},}, term: 9, version: 1658, reason: Publication{term=9, version=1658}

Note that the reason for removal is the well-hidden word lagging indicating that this node failed to apply a cluster state update within the 2 minute lag timeout. In 7.4.0 this gets logged more noisily at WARN level.

Looking at the node's logs we see some egregiously slow cluster state applications:

[2019-08-07T02:11:27,946][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1631, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [17.9m] which is above the warn threshold of 30s
[2019-08-07T02:12:39,294][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1687, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [52.4s] which is above the warn threshold of 30s
[2019-08-07T02:13:52,351][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1693, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [57.3s] which is above the warn threshold of 30s
[2019-08-07T02:32:01,205][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1705, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [17m] which is above the warn threshold of 30s
[2019-08-07T02:49:20,181][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1756, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [17.2m] which is above the warn threshold of 30s
[2019-08-07T03:06:23,071][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1801, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [16.9m] which is above the warn threshold of 30s
[2019-08-07T03:24:05,209][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1844, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [17.6m] which is above the warn threshold of 30s
[2019-08-07T03:40:05,324][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=1889, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [15.8m] which is above the warn threshold of 30s

Taking ≥ 15 minutes to apply a cluster state is very bad. logger.org.elasticsearch.cluster.service: TRACE on the data node will tell us which applier is so slow. The default logging here is also improved in 7.4.

I also see this message:

[2019-08-07T01:45:03,826][WARN ][o.e.i.e.Engine           ] [elastic-data4-prod-us-central1-55q5] [hsc-searchresultlogs-prod_2019.08.06][0] failed engine [failed to recover from translog]
org.elasticsearch.index.engine.EngineException: failed to recover from translog
	at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:468) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:438) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:125) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1480) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:433) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1752) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2406) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.0.jar:7.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[CLOSED] operation only allowed when recovering, origin [LOCAL_TRANSLOG_RECOVERY]
	at org.elasticsearch.index.shard.IndexShard.ensureWriteAllowed(IndexShard.java:1626) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:738) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1410) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShard.runTranslogRecovery(IndexShard.java:1441) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShard.lambda$openEngineAndRecoverFromTranslog$6(IndexShard.java:1476) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:466) ~[elasticsearch-7.3.0.jar:7.3.0]
	... 13 more

Is this what led you to believe that there was a translog corruption? There's no corruption here, this indicates that a primary shard was in the process of recovering from its local translog and then the master cancelled that recovery, probably because the node was kicked out of the cluster.

I see, any Idea about what could've caused the excessive delay in applying the cluster state? As far as we can see the node was completely idle (https://cl.ly/cfddb7). And this cluster only has ~150 indices (replication of 1 and 5 shards) which should not cause any issues with writing the state to disk (I think/hope).

The data node was also running with logger.org.elasticsearch.cluster.service: TRACE set in the elasticsearch.yml file. Not from the middle of the night but since [2019-08-07T09:28:27,994] (from the log).

From the CLOSED state of the hsc-searchresultlogs-prod_2019.08.06 index I think that the fragment is from where I manually tried to close the index (before removing it completely).

Oh great so it was. Here:

[2019-08-07T09:28:42,041][DEBUG][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] processing [ApplyCommitRequest{term=9, version=2907, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}]: execute
[2019-08-07T09:28:42,054][DEBUG][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state updated, version [2907], source [ApplyCommitRequest{term=9, version=2907, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}]
[2019-08-07T09:28:42,061][TRACE][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] connecting to nodes of cluster state with version 2907
[2019-08-07T09:28:42,062][DEBUG][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] applying settings from cluster state with version 2907
[2019-08-07T09:28:42,062][DEBUG][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] apply cluster state with version 2907
[2019-08-07T09:28:42,062][TRACE][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] calling [org.elasticsearch.indices.cluster.IndicesClusterStateService@41e749f6] with change to version [2907]
[2019-08-07T09:44:46,155][TRACE][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] calling [org.elasticsearch.script.ScriptService@31a46167] with change to version [2907]
[2019-08-07T09:44:46,155][TRACE][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] calling [org.elasticsearch.xpack.indexlifecycle.IndexLifecycleService@7859ffb8] with change to version [2907]
...
[2019-08-07T09:44:46,225][DEBUG][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] processing [ApplyCommitRequest{term=9, version=2907, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}]: took [16m] done applying updated cluster state (version: 2907, uuid: PgepKvg3QI2aU_5_9b0ZzQ)
[2019-08-07T09:44:46,225][WARN ][o.e.c.s.ClusterApplierService] [elastic-data4-prod-us-central1-55q5] cluster state applier task [ApplyCommitRequest{term=9, version=2907, sourceNode={elastic-master-prod-us-central1-kkmd}{CbcPaUgERYOlQWxNZwG0Ig}{-_eHWR10Qw6kH5AaahlWpw}{10.248.24.138}{10.248.24.138:9300}{m}{ml.machine_memory=7847473152, ml.max_open_jobs=20, xpack.installed=true, box_type=cold}}] took [16m] which is above the warn threshold of 30s

Most of that time is in the IndicesClusterStateService which makes me suspect #43699. Does this index use eager_global_ordinals? If so, I would suggest trying to disable that.

It seems that it was related to the size of the shard and not to transaction log. We don't use eager_global_ordinals at all in the cluster. But 43699 seems like a good candidate, this was indeed a huge shard. It is possible that we ran into the same issue even without using eager_global_ordinals?

The same index from previous days (some times) goes up to ~500GB (for the entire index):

  "docs": {
    "count": 28867644,
    "deleted": 0
  },
  "store": {
    "size": "508.9gb",
    "size_in_bytes": 546455381093
  },

The main difference is that in this case, all the data was in a single shard.

The best way to tell would be good to capture the output of GET _nodes/elastic-data4-prod-us-central1-55q5/hot_threads?threads=99999&ignore_idle_threads=false while it's running the applier from the IndicesClusterStateService so we can see what else it's busy doing.

Unfortunately now I cannot reproduce the exact issue since I deleted the index. After removing the index the node recovered and joined the cluster.

I could try to ingest/reindex all into a new node/index with a single shard and roughly the same amount of data >200GB and stopping starting that node and see how it behaves.

Thanks, yes, it'd be good to get to the bottom of it so we can be sure that it's covered by #43699 and is not a different issue.

Hello again @DavidTurner:

I've run into a similar situation today. One node went offline after triggering a normal deletion of old indices with curator. The node was showing the same symptoms (joining-leaving the cluster in a loop). I took the opportunity to take a look at the hot_threads while the node bootstrapped and here is the result:

The node is still "flapping". This data node has 32vCPU and 30GB of RAM, and it is mostly idle:

A couple more of hot_threads:

From a brief check it does look to be starting up engines slowly, perhaps due to slow IO, but not global ordinals. I'm not going to be available to look more for a couple of weeks. I'd recommend looking into the slow IO possibility some more, and if you can rule that out then open an issue on GitHub and link it back to this thread.

Would you say that faster I/O is required for the 7.x versions? We've been running this cluster in 6.x prior to the upgrade without major issues (at least not related to nodes dropping out of the cluster).

These nodes are running with normal pd-standard storage from GCP so I/O its not the fastest. Still, it feels strange that the nodes leave (or are kicked out of) the cluster due to a DELETE command or some actions over an initialising shard (as we suspect from this thread).

I don't think any of the slow bits here have changed significantly in version 7. However, versions before 7 would simply silently struggle on if nodes had such egregiously slow IO, which can make other failures much worse; v7 is stricter, to surface these issues earlier.