After upgrading from ES 7, it is easy for a node to be unable to join the cluster for a long time after leaving. Restart the problem node and join it immediately

Versions 8.8.2 or 8.13.2 have this problem. The cluster data size is usually 300-700 tb.

The log will be stuck in waiting for local cluster applier for more than ten minutes or one or two minutes. When you restart the problem node, it will rejoin the cluster almost immediately. This problem occurs in multiple clusters of ours. It is cold node hdd that will have this problem. It is possible that he is backing up snapshots or performing high-cost searches. I admit that the io pressure may be too great for hdd, but this problem has never occurred before upgrading es 8 (I saw the code introduction at Block joins while applier is busy (#84919) · elastic/elasticsearch@c88dd10 · GitHub).At most, the query is slowed down because of the io bottleneck, which is acceptable. However, leaving the node and not joining the cluster for a long time will cause incomplete data query results.

What should I do, or how can I reduce the IO pressure, because the IO capacity of HDD is not high. And I restart the node and rejoin the cluster immediately.

I adjusted

PUT _all/_settings
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "5m"
  }
}

It seems to be working. At least there has been no node disconnection in the 12 hours since the adjustment.

Below is part of the log

[3] consecutive checks of the master node [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}] were unsuccessful ([3] rejected, [0] timed out), restarting discovery; more details may be available in the master node logs [last unsuccessful check: rejecting check since [{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}] has been removed from the cluster]
master node changed {previous [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}], current []}, term: 66, version: 5607662, reason: becoming candidate: onLeaderFailure
master node changed {previous [], current [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}]}, term: 66, version: 5607667, reason: ApplyCommitRequest{term=66, version=5607667, sourceNode={client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}{ml.config_version=12.0.0, xpack.installed=true, transform.config_version=10.0.0}}
node-join[{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000} joining, removed [33.4s/33412ms] ago with reason [followers check retry count exceeded [timeouts=3, failures=0]], [9] total removals], term: 66, version: 5607667, delta: added {{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}}
after [10s] publication of cluster state version [5607667] is still waiting for {data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}{ml.config_version=12.0.0, zone=stale, transform.config_version=10.0.0, xpack.installed=true} [SENT_APPLY_COMMIT]
added {{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}}, term: 66, version: 5607667, reason: Publication{term=66, version=5607667}
node-join: [{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}] with reason [joining, removed [33.4s/33412ms] ago with reason [followers check retry count exceeded [timeouts=3, failures=0]], [9] total removals]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.13/troubleshooting-unstable-cluster.html
node [{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}{ml.config_version=12.0.0, zone=stale, transform.config_version=10.0.0, xpack.installed=true}] is lagging at cluster state version [0], although publication of cluster state version [5607667] completed [1.5m] ago
master not discovered yet: have discovered [{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}, {client-01}{OUwpT0qyTJOuRz7k4HxlKQ}{wr_8PPMTQbWSL7FZxfJ-pA}{client-01}{7.32.146.96}{7.32.146.96:9300}{mr}{8.13.2}{7000099-8503000}, {master-02}{Rh4InA0gQumqNu5w49DjHg}{C2YTWQgZQdG_ZRaT1Qkq1w}{master-02}{7.32.137.102}{7.32.137.102:9301}{m}{8.13.2}{7000099-8503000}, {master-01}{uyTujKeRRjugDIieqT2PEA}{FOlws3_PRrCGqqE7gpFTTw}{master-01}{7.32.137.110}{7.32.137.110:9304}{m}{8.13.2}{7000099-8503000}, {client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}, {master-03}{9PDHtkV0TPyz1kP_85v2Qw}{R6lAnctaQ5WARbVBBBoEeQ}{master-03}{7.32.137.103}{7.32.137.103:9300}{m}{8.13.2}{7000099-8503000}] who claim current master to be [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}]; discovery will continue using [7.32.146.96:9300, 7.32.146.116:9300] from hosts providers and [{client-01}{OUwpT0qyTJOuRz7k4HxlKQ}{wr_8PPMTQbWSL7FZxfJ-pA}{client-01}{7.32.146.96}{7.32.146.96:9300}{mr}{8.13.2}{7000099-8503000}, {master-02}{Rh4InA0gQumqNu5w49DjHg}{C2YTWQgZQdG_ZRaT1Qkq1w}{master-02}{7.32.137.102}{7.32.137.102:9301}{m}{8.13.2}{7000099-8503000}, {client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}, {master-03}{9PDHtkV0TPyz1kP_85v2Qw}{R6lAnctaQ5WARbVBBBoEeQ}{master-03}{7.32.137.103}{7.32.137.103:9300}{m}{8.13.2}{7000099-8503000}, {master-01}{uyTujKeRRjugDIieqT2PEA}{FOlws3_PRrCGqqE7gpFTTw}{master-01}{7.32.137.110}{7.32.137.110:9304}{m}{8.13.2}{7000099-8503000}] from last-known cluster state; node term 66, last-accepted version 5607670 in term 66; joining [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}] in term [66] has status [waiting for local cluster applier] after [1.1m/70024ms]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.13/discovery-troubleshooting.html

Check the document Troubleshooting discovery | Elasticsearch Guide [8.13] | Elastic and learn that if you need to troubleshoot, you can only use jstack. I don't see more useful information in the existing es logs

That's right, please share a jstack from a node when it's in this state. It's not possible to answer your questions without that.

I don't think the problem is caused by that PR. As mentioned in the PR description:

If the application takes long enough then
there may be time for the node to rejoin the cluster, time out applying
the state resulting from the rejoin, and be removed again. This can
happen repeatedly and is disruptive to the rest of the cluster.

In other words, the node would fail to join until the applier thread becomes free. So the question is why is the applier thread not free. That's why we need to see those stack dumps.

please see stackdump · GitHub

"elasticsearch[data-35-stale][clusterApplierService#updateTask][T#1]" #59 [460] daemon prio=5 os_prio=0 cpu=1796297.22ms elapsed=348952.88s tid=0x00007fa93bf86750 nid=460 runnable  [0x00007fa8dd4a1000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.UnixFileDispatcherImpl.pread0(java.base@20.0.1/Native Method)
	at sun.nio.ch.UnixFileDispatcherImpl.pread(java.base@20.0.1/UnixFileDispatcherImpl.java:57)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(java.base@20.0.1/IOUtil.java:338)
	at sun.nio.ch.IOUtil.read(java.base@20.0.1/IOUtil.java:306)
	at sun.nio.ch.IOUtil.read(java.base@20.0.1/IOUtil.java:283)
	at sun.nio.ch.FileChannelImpl.readInternal(java.base@20.0.1/FileChannelImpl.java:986)
	at sun.nio.ch.FileChannelImpl.read(java.base@20.0.1/FileChannelImpl.java:969)
	at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(org.apache.lucene.core@9.6.0/NIOFSDirectory.java:179)
	at org.apache.lucene.store.BufferedIndexInput.refill(org.apache.lucene.core@9.6.0/BufferedIndexInput.java:291)
	at org.apache.lucene.store.BufferedIndexInput.readByte(org.apache.lucene.core@9.6.0/BufferedIndexInput.java:55)
	at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(org.apache.lucene.core@9.6.0/BufferedChecksumIndexInput.java:39)
	at org.apache.lucene.codecs.CodecUtil.readBEInt(org.apache.lucene.core@9.6.0/CodecUtil.java:667)
	at org.apache.lucene.codecs.CodecUtil.checkHeader(org.apache.lucene.core@9.6.0/CodecUtil.java:184)
	at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(org.apache.lucene.core@9.6.0/CodecUtil.java:253)
	at org.apache.lucene.codecs.lucene90.Lucene90SegmentInfoFormat.read(org.apache.lucene.core@9.6.0/Lucene90SegmentInfoFormat.java:106)
	at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(org.apache.lucene.core@9.6.0/SegmentInfos.java:407)
	at org.apache.lucene.index.SegmentInfos.readCommit(org.apache.lucene.core@9.6.0/SegmentInfos.java:364)
	at org.apache.lucene.index.SegmentInfos.readCommit(org.apache.lucene.core@9.6.0/SegmentInfos.java:300)
	at org.apache.lucene.index.SegmentInfos.readCommit(org.apache.lucene.core@9.6.0/SegmentInfos.java:289)
	at org.elasticsearch.index.engine.CombinedDeletionPolicy.getDocCountOfCommit(org.elasticsearch.server@8.8.2/CombinedDeletionPolicy.java:180)
	at org.elasticsearch.index.engine.CombinedDeletionPolicy.onCommit(org.elasticsearch.server@8.8.2/CombinedDeletionPolicy.java:102)
	at org.apache.lucene.index.IndexFileDeleter.checkpoint(org.apache.lucene.core@9.6.0/IndexFileDeleter.java:582)
	at org.apache.lucene.index.IndexWriter.finishCommit(org.apache.lucene.core@9.6.0/IndexWriter.java:4097)
	- locked <0x0000001045133848> (a org.apache.lucene.index.IndexWriter)
	at org.apache.lucene.index.IndexWriter.commitInternal(org.apache.lucene.core@9.6.0/IndexWriter.java:4051)
	- locked <0x000000104549a508> (a java.lang.Object)
	at org.apache.lucene.index.IndexWriter.commit(org.apache.lucene.core@9.6.0/IndexWriter.java:4005)
	at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(org.elasticsearch.server@8.8.2/InternalEngine.java:2725)
	at org.elasticsearch.index.engine.InternalEngine.flush(org.elasticsearch.server@8.8.2/InternalEngine.java:2068)
	at org.elasticsearch.index.engine.Engine.flush(org.elasticsearch.server@8.8.2/Engine.java:1061)
	at org.elasticsearch.index.engine.Engine.flushAndClose(org.elasticsearch.server@8.8.2/Engine.java:1763)
	at org.elasticsearch.index.shard.IndexShard.close(org.elasticsearch.server@8.8.2/IndexShard.java:1640)
	- locked <0x0000001045132380> (a java.lang.Object)
	at org.elasticsearch.index.IndexService.closeShard(org.elasticsearch.server@8.8.2/IndexService.java:579)
	at org.elasticsearch.index.IndexService.removeShard(org.elasticsearch.server@8.8.2/IndexService.java:559)
	- locked <0x000000104509ff80> (a org.elasticsearch.index.IndexService)
	at org.elasticsearch.index.IndexService.close(org.elasticsearch.server@8.8.2/IndexService.java:354)
	- locked <0x000000104509ff80> (a org.elasticsearch.index.IndexService)
	at org.elasticsearch.indices.IndicesService.removeIndex(org.elasticsearch.server@8.8.2/IndicesService.java:899)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.removeIndicesAndShards(org.elasticsearch.server@8.8.2/IndicesClusterStateService.java:408)

This indicates that ES is waiting on IO while closing a shard in order to move it to a different node. There is an open issue tracking this (#89821) but it's been a problem for a very long time, it's definitely not new in ES 8.

The stack trace indicates that you're writing to indices on cold nodes. The simplest solution is not to do that. Spinning disks are dreadfully slow at the best of times. Make sure that all your writeable indices are on nodes with better-performing disks.

My cold nodes are all historical indexes after rollover, can I find out what index is causing it? This doesn't meet my expectations, or can I avoid this by setting the index to block write

The close index is to avoid accidentally deleting the index before deleting it, and let it close first, which is the execution time period that I can control

Migrating data from a hot node to a cold node or a cold node is balancing, is this an index write?

If you are not expecting older indices to be updated or have documents deleted I would recommend you set them as read-only as part of your ILM policy.

Do you have any processes or tasks that use update-by-query or delete-by-query against these types of indices?

You mean find which index is being written to, when you don't expect it to be written to?

There's the stats in Stack Management for a start, checking those one index at a time doesn't scale too well.

If new docs are being added (or deleted), just dump /_cat/indices , wait a bit, dump again and see what indices are now bigger or smaller (in #docs)

(one long line, you will likely want to adjust the sleep 10 to soemthing bigger)

sdiff -s <(sleep 10 && curl -s -k -u "${EUSER}":"${EPASS}"  "https://${EHOST}:${EPORT}/_cat/indices" | awk '{print $3,$7}' | sort)  <(curl -s -k -u "${EUSER}":"${EPASS}"  "https://${EHOST}:${EPORT}/_cat/indices" | awk '{print $3,$7}' | sort)

Maybe better ways, but a bit of a hacky way would be to just to look at the files, something like

find /path/to/your/data -type f -exec stat -c "%y - %n" {} \; | sort -k 1,2

This might produce some misleading results, but you might be lucky.

I've set ilm read only, check the indexsetting added {"index.blocks":{"write":"true"}}

The user client will only link the alias of the rollover, in other words, it will not update or write to the historical index

We take backup snapshots of historical indexes, close and delete indexes

Thanks, I'll give it a try, it may be later due to the Chinese New Year holiday

Obviously, if you have now set all the suspect indices to read-only via ILM, then #docs should not change in those indices.

If no client , or something, was somehow causing IO activity to those same indices, my understanding from the thread is that you wouldn’t have had a problem in the first place.

I would expect you can by looking at GET _stats?level=shards, particularly .indices[].shards[][].indexing.index_total and .indices[].shards[][].indexing.delete_total. Unfortunately it'll take a bit of work because .indices[].shards[][].routing.node is the internal node ID and you'll need to look that up in GET _nodes/_all/none to work out which ones are cold nodes.

No, it's not

@zcola did you solve your mystery?

I will continue to investigate this issue after I go to work tomorrow after the holidays, thanks