After upgrading from ES 7, it is easy for a node to be unable to join the cluster for a long time after leaving. Restart the problem node and join it immediately

zcola · January 23, 2025, 3:18am

Versions 8.8.2 or 8.13.2 have this problem. The cluster data size is usually 300-700 tb.

The log will be stuck in waiting for local cluster applier for more than ten minutes or one or two minutes. When you restart the problem node, it will rejoin the cluster almost immediately. This problem occurs in multiple clusters of ours. It is cold node hdd that will have this problem. It is possible that he is backing up snapshots or performing high-cost searches. I admit that the io pressure may be too great for hdd, but this problem has never occurred before upgrading es 8 (I saw the code introduction at Block joins while applier is busy (#84919) · elastic/elasticsearch@c88dd10 · GitHub).At most, the query is slowed down because of the io bottleneck, which is acceptable. However, leaving the node and not joining the cluster for a long time will cause incomplete data query results.

What should I do, or how can I reduce the IO pressure, because the IO capacity of HDD is not high. And I restart the node and rejoin the cluster immediately.

I adjusted

PUT _all/_settings
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "5m"
  }
}

It seems to be working. At least there has been no node disconnection in the 12 hours since the adjustment.

Below is part of the log

[3] consecutive checks of the master node [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}] were unsuccessful ([3] rejected, [0] timed out), restarting discovery; more details may be available in the master node logs [last unsuccessful check: rejecting check since [{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}] has been removed from the cluster]
master node changed {previous [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}], current []}, term: 66, version: 5607662, reason: becoming candidate: onLeaderFailure
master node changed {previous [], current [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}]}, term: 66, version: 5607667, reason: ApplyCommitRequest{term=66, version=5607667, sourceNode={client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}{ml.config_version=12.0.0, xpack.installed=true, transform.config_version=10.0.0}}
node-join[{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000} joining, removed [33.4s/33412ms] ago with reason [followers check retry count exceeded [timeouts=3, failures=0]], [9] total removals], term: 66, version: 5607667, delta: added {{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}}
after [10s] publication of cluster state version [5607667] is still waiting for {data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}{ml.config_version=12.0.0, zone=stale, transform.config_version=10.0.0, xpack.installed=true} [SENT_APPLY_COMMIT]
added {{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}}, term: 66, version: 5607667, reason: Publication{term=66, version=5607667}
node-join: [{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}] with reason [joining, removed [33.4s/33412ms] ago with reason [followers check retry count exceeded [timeouts=3, failures=0]], [9] total removals]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.13/troubleshooting-unstable-cluster.html
node [{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}{ml.config_version=12.0.0, zone=stale, transform.config_version=10.0.0, xpack.installed=true}] is lagging at cluster state version [0], although publication of cluster state version [5607667] completed [1.5m] ago
master not discovered yet: have discovered [{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}, {client-01}{OUwpT0qyTJOuRz7k4HxlKQ}{wr_8PPMTQbWSL7FZxfJ-pA}{client-01}{7.32.146.96}{7.32.146.96:9300}{mr}{8.13.2}{7000099-8503000}, {master-02}{Rh4InA0gQumqNu5w49DjHg}{C2YTWQgZQdG_ZRaT1Qkq1w}{master-02}{7.32.137.102}{7.32.137.102:9301}{m}{8.13.2}{7000099-8503000}, {master-01}{uyTujKeRRjugDIieqT2PEA}{FOlws3_PRrCGqqE7gpFTTw}{master-01}{7.32.137.110}{7.32.137.110:9304}{m}{8.13.2}{7000099-8503000}, {client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}, {master-03}{9PDHtkV0TPyz1kP_85v2Qw}{R6lAnctaQ5WARbVBBBoEeQ}{master-03}{7.32.137.103}{7.32.137.103:9300}{m}{8.13.2}{7000099-8503000}] who claim current master to be [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}]; discovery will continue using [7.32.146.96:9300, 7.32.146.116:9300] from hosts providers and [{client-01}{OUwpT0qyTJOuRz7k4HxlKQ}{wr_8PPMTQbWSL7FZxfJ-pA}{client-01}{7.32.146.96}{7.32.146.96:9300}{mr}{8.13.2}{7000099-8503000}, {master-02}{Rh4InA0gQumqNu5w49DjHg}{C2YTWQgZQdG_ZRaT1Qkq1w}{master-02}{7.32.137.102}{7.32.137.102:9301}{m}{8.13.2}{7000099-8503000}, {client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}, {master-03}{9PDHtkV0TPyz1kP_85v2Qw}{R6lAnctaQ5WARbVBBBoEeQ}{master-03}{7.32.137.103}{7.32.137.103:9300}{m}{8.13.2}{7000099-8503000}, {master-01}{uyTujKeRRjugDIieqT2PEA}{FOlws3_PRrCGqqE7gpFTTw}{master-01}{7.32.137.110}{7.32.137.110:9304}{m}{8.13.2}{7000099-8503000}] from last-known cluster state; node term 66, last-accepted version 5607670 in term 66; joining [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}] in term [66] has status [waiting for local cluster applier] after [1.1m/70024ms]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.13/discovery-troubleshooting.html

Check the document Troubleshooting discovery | Elasticsearch Guide [8.13] | Elastic and learn that if you need to troubleshoot, you can only use jstack. I don't see more useful information in the existing es logs

DavidTurner · January 23, 2025, 8:07am

That's right, please share a jstack from a node when it's in this state. It's not possible to answer your questions without that.

I don't think the problem is caused by that PR. As mentioned in the PR description:

If the application takes long enough then
there may be time for the node to rejoin the cluster, time out applying
the state resulting from the rejoin, and be removed again. This can
happen repeatedly and is disruptive to the rest of the cluster.

In other words, the node would fail to join until the applier thread becomes free. So the question is why is the applier thread not free. That's why we need to see those stack dumps.

zcola · January 24, 2025, 4:02am

please see stackdump · GitHub

DavidTurner · January 24, 2025, 9:27am

"elasticsearch[data-35-stale][clusterApplierService#updateTask][T#1]" #59 [460] daemon prio=5 os_prio=0 cpu=1796297.22ms elapsed=348952.88s tid=0x00007fa93bf86750 nid=460 runnable  [0x00007fa8dd4a1000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.UnixFileDispatcherImpl.pread0(java.base@20.0.1/Native Method)
	at sun.nio.ch.UnixFileDispatcherImpl.pread(java.base@20.0.1/UnixFileDispatcherImpl.java:57)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(java.base@20.0.1/IOUtil.java:338)
	at sun.nio.ch.IOUtil.read(java.base@20.0.1/IOUtil.java:306)
	at sun.nio.ch.IOUtil.read(java.base@20.0.1/IOUtil.java:283)
	at sun.nio.ch.FileChannelImpl.readInternal(java.base@20.0.1/FileChannelImpl.java:986)
	at sun.nio.ch.FileChannelImpl.read(java.base@20.0.1/FileChannelImpl.java:969)
	at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(org.apache.lucene.core@9.6.0/NIOFSDirectory.java:179)
	at org.apache.lucene.store.BufferedIndexInput.refill(org.apache.lucene.core@9.6.0/BufferedIndexInput.java:291)
	at org.apache.lucene.store.BufferedIndexInput.readByte(org.apache.lucene.core@9.6.0/BufferedIndexInput.java:55)
	at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(org.apache.lucene.core@9.6.0/BufferedChecksumIndexInput.java:39)
	at org.apache.lucene.codecs.CodecUtil.readBEInt(org.apache.lucene.core@9.6.0/CodecUtil.java:667)
	at org.apache.lucene.codecs.CodecUtil.checkHeader(org.apache.lucene.core@9.6.0/CodecUtil.java:184)
	at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(org.apache.lucene.core@9.6.0/CodecUtil.java:253)
	at org.apache.lucene.codecs.lucene90.Lucene90SegmentInfoFormat.read(org.apache.lucene.core@9.6.0/Lucene90SegmentInfoFormat.java:106)
	at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(org.apache.lucene.core@9.6.0/SegmentInfos.java:407)
	at org.apache.lucene.index.SegmentInfos.readCommit(org.apache.lucene.core@9.6.0/SegmentInfos.java:364)
	at org.apache.lucene.index.SegmentInfos.readCommit(org.apache.lucene.core@9.6.0/SegmentInfos.java:300)
	at org.apache.lucene.index.SegmentInfos.readCommit(org.apache.lucene.core@9.6.0/SegmentInfos.java:289)
	at org.elasticsearch.index.engine.CombinedDeletionPolicy.getDocCountOfCommit(org.elasticsearch.server@8.8.2/CombinedDeletionPolicy.java:180)
	at org.elasticsearch.index.engine.CombinedDeletionPolicy.onCommit(org.elasticsearch.server@8.8.2/CombinedDeletionPolicy.java:102)
	at org.apache.lucene.index.IndexFileDeleter.checkpoint(org.apache.lucene.core@9.6.0/IndexFileDeleter.java:582)
	at org.apache.lucene.index.IndexWriter.finishCommit(org.apache.lucene.core@9.6.0/IndexWriter.java:4097)
	- locked <0x0000001045133848> (a org.apache.lucene.index.IndexWriter)
	at org.apache.lucene.index.IndexWriter.commitInternal(org.apache.lucene.core@9.6.0/IndexWriter.java:4051)
	- locked <0x000000104549a508> (a java.lang.Object)
	at org.apache.lucene.index.IndexWriter.commit(org.apache.lucene.core@9.6.0/IndexWriter.java:4005)
	at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(org.elasticsearch.server@8.8.2/InternalEngine.java:2725)
	at org.elasticsearch.index.engine.InternalEngine.flush(org.elasticsearch.server@8.8.2/InternalEngine.java:2068)
	at org.elasticsearch.index.engine.Engine.flush(org.elasticsearch.server@8.8.2/Engine.java:1061)
	at org.elasticsearch.index.engine.Engine.flushAndClose(org.elasticsearch.server@8.8.2/Engine.java:1763)
	at org.elasticsearch.index.shard.IndexShard.close(org.elasticsearch.server@8.8.2/IndexShard.java:1640)
	- locked <0x0000001045132380> (a java.lang.Object)
	at org.elasticsearch.index.IndexService.closeShard(org.elasticsearch.server@8.8.2/IndexService.java:579)
	at org.elasticsearch.index.IndexService.removeShard(org.elasticsearch.server@8.8.2/IndexService.java:559)
	- locked <0x000000104509ff80> (a org.elasticsearch.index.IndexService)
	at org.elasticsearch.index.IndexService.close(org.elasticsearch.server@8.8.2/IndexService.java:354)
	- locked <0x000000104509ff80> (a org.elasticsearch.index.IndexService)
	at org.elasticsearch.indices.IndicesService.removeIndex(org.elasticsearch.server@8.8.2/IndicesService.java:899)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.removeIndicesAndShards(org.elasticsearch.server@8.8.2/IndicesClusterStateService.java:408)

This indicates that ES is waiting on IO while closing a shard in order to move it to a different node. There is an open issue tracking this (#89821) but it's been a problem for a very long time, it's definitely not new in ES 8.

The stack trace indicates that you're writing to indices on cold nodes. The simplest solution is not to do that. Spinning disks are dreadfully slow at the best of times. Make sure that all your writeable indices are on nodes with better-performing disks.

zcola · January 25, 2025, 10:06am

My cold nodes are all historical indexes after rollover, can I find out what index is causing it? This doesn't meet my expectations, or can I avoid this by setting the index to block write

The close index is to avoid accidentally deleting the index before deleting it, and let it close first, which is the execution time period that I can control

Migrating data from a hot node to a cold node or a cold node is balancing, is this an index write?

Christian_Dahlqvist · January 25, 2025, 10:22am

If you are not expecting older indices to be updated or have documents deleted I would recommend you set them as read-only as part of your ILM policy.

Do you have any processes or tasks that use update-by-query or delete-by-query against these types of indices?

RainTown · January 25, 2025, 12:35pm

You mean find which index is being written to, when you don't expect it to be written to?

There's the stats in Stack Management for a start, checking those one index at a time doesn't scale too well.

If new docs are being added (or deleted), just dump /_cat/indices , wait a bit, dump again and see what indices are now bigger or smaller (in #docs)

(one long line, you will likely want to adjust the sleep 10 to soemthing bigger)

sdiff -s <(sleep 10 && curl -s -k -u "${EUSER}":"${EPASS}"  "https://${EHOST}:${EPORT}/_cat/indices" | awk '{print $3,$7}' | sort)  <(curl -s -k -u "${EUSER}":"${EPASS}"  "https://${EHOST}:${EPORT}/_cat/indices" | awk '{print $3,$7}' | sort)

Maybe better ways, but a bit of a hacky way would be to just to look at the files, something like

find /path/to/your/data -type f -exec stat -c "%y - %n" {} \; | sort -k 1,2

This might produce some misleading results, but you might be lucky.

zcola · January 26, 2025, 2:18am

I've set ilm read only, check the indexsetting added {"index.blocks":{"write":"true"}}

The user client will only link the alias of the rollover, in other words, it will not update or write to the historical index

We take backup snapshots of historical indexes, close and delete indexes

zcola · January 26, 2025, 2:20am

Thanks, I'll give it a try, it may be later due to the Chinese New Year holiday

RainTown · January 26, 2025, 7:12am

Obviously, if you have now set all the suspect indices to read-only via ILM, then #docs should not change in those indices.

If no client , or something, was somehow causing IO activity to those same indices, my understanding from the thread is that you wouldn’t have had a problem in the first place.

DavidTurner · January 26, 2025, 9:23pm

I would expect you can by looking at GET _stats?level=shards, particularly .indices[].shards[][].indexing.index_total and .indices[].shards[][].indexing.delete_total. Unfortunately it'll take a bit of work because .indices[].shards[][].routing.node is the internal node ID and you'll need to look that up in GET _nodes/_all/none to work out which ones are cold nodes.

DavidTurner · January 26, 2025, 9:29pm

No, it's not

RainTown · January 30, 2025, 12:55pm

@zcola did you solve your mystery?

zcola · February 5, 2025, 9:07am

I will continue to investigate this issue after I go to work tomorrow after the holidays, thanks

Topic		Replies	Views
Long period of querying failure during node timeout Elasticsearch	4	1061	May 15, 2020
Elasticsearch cluster instability Elasticsearch	13	2828	July 6, 2017
Increased node timeouts after 7.7.0 to 7.10.1 upgrade Elasticsearch	3	710	January 16, 2021
Nodes continuously leaving and rejoining the cluster in 7.1 cluster after master switch Elasticsearch	8	2022	October 15, 2020
Elasticsearch7.4.2 takes too long to form a cluster Elasticsearch	4	1131	December 17, 2019

After upgrading from ES 7, it is easy for a node to be unable to join the cluster for a long time after leaving. Restart the problem node and join it immediately

Related topics