Versions 8.8.2 or 8.13.2 have this problem. The cluster data size is usually 300-700 tb.
The log will be stuck in waiting for local cluster applier
for more than ten minutes or one or two minutes. When you restart the problem node, it will rejoin the cluster almost immediately. This problem occurs in multiple clusters of ours. It is cold node hdd that will have this problem. It is possible that he is backing up snapshots or performing high-cost searches. I admit that the io pressure may be too great for hdd, but this problem has never occurred before upgrading es 8 (I saw the code introduction at Block joins while applier is busy (#84919) · elastic/elasticsearch@c88dd10 · GitHub).At most, the query is slowed down because of the io bottleneck, which is acceptable. However, leaving the node and not joining the cluster for a long time will cause incomplete data query results.
What should I do, or how can I reduce the IO pressure, because the IO capacity of HDD is not high. And I restart the node and rejoin the cluster immediately.
I adjusted
PUT _all/_settings
{
"settings": {
"index.unassigned.node_left.delayed_timeout": "5m"
}
}
It seems to be working. At least there has been no node disconnection in the 12 hours since the adjustment.
Below is part of the log
[3] consecutive checks of the master node [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}] were unsuccessful ([3] rejected, [0] timed out), restarting discovery; more details may be available in the master node logs [last unsuccessful check: rejecting check since [{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}] has been removed from the cluster]
master node changed {previous [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}], current []}, term: 66, version: 5607662, reason: becoming candidate: onLeaderFailure
master node changed {previous [], current [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}]}, term: 66, version: 5607667, reason: ApplyCommitRequest{term=66, version=5607667, sourceNode={client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}{ml.config_version=12.0.0, xpack.installed=true, transform.config_version=10.0.0}}
node-join[{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000} joining, removed [33.4s/33412ms] ago with reason [followers check retry count exceeded [timeouts=3, failures=0]], [9] total removals], term: 66, version: 5607667, delta: added {{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}}
after [10s] publication of cluster state version [5607667] is still waiting for {data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}{ml.config_version=12.0.0, zone=stale, transform.config_version=10.0.0, xpack.installed=true} [SENT_APPLY_COMMIT]
added {{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}}, term: 66, version: 5607667, reason: Publication{term=66, version=5607667}
node-join: [{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}] with reason [joining, removed [33.4s/33412ms] ago with reason [followers check retry count exceeded [timeouts=3, failures=0]], [9] total removals]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.13/troubleshooting-unstable-cluster.html
node [{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}{ml.config_version=12.0.0, zone=stale, transform.config_version=10.0.0, xpack.installed=true}] is lagging at cluster state version [0], although publication of cluster state version [5607667] completed [1.5m] ago
master not discovered yet: have discovered [{data-62-stale}{Lq8UAE-xSv22hzQSNgR2hA}{qYQaQ1O5QKK7PNqOtIpeCw}{data-62-stale}{10.90.142.67}{10.90.142.67:9300}{di}{8.13.2}{7000099-8503000}, {client-01}{OUwpT0qyTJOuRz7k4HxlKQ}{wr_8PPMTQbWSL7FZxfJ-pA}{client-01}{7.32.146.96}{7.32.146.96:9300}{mr}{8.13.2}{7000099-8503000}, {master-02}{Rh4InA0gQumqNu5w49DjHg}{C2YTWQgZQdG_ZRaT1Qkq1w}{master-02}{7.32.137.102}{7.32.137.102:9301}{m}{8.13.2}{7000099-8503000}, {master-01}{uyTujKeRRjugDIieqT2PEA}{FOlws3_PRrCGqqE7gpFTTw}{master-01}{7.32.137.110}{7.32.137.110:9304}{m}{8.13.2}{7000099-8503000}, {client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}, {master-03}{9PDHtkV0TPyz1kP_85v2Qw}{R6lAnctaQ5WARbVBBBoEeQ}{master-03}{7.32.137.103}{7.32.137.103:9300}{m}{8.13.2}{7000099-8503000}] who claim current master to be [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}]; discovery will continue using [7.32.146.96:9300, 7.32.146.116:9300] from hosts providers and [{client-01}{OUwpT0qyTJOuRz7k4HxlKQ}{wr_8PPMTQbWSL7FZxfJ-pA}{client-01}{7.32.146.96}{7.32.146.96:9300}{mr}{8.13.2}{7000099-8503000}, {master-02}{Rh4InA0gQumqNu5w49DjHg}{C2YTWQgZQdG_ZRaT1Qkq1w}{master-02}{7.32.137.102}{7.32.137.102:9301}{m}{8.13.2}{7000099-8503000}, {client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}, {master-03}{9PDHtkV0TPyz1kP_85v2Qw}{R6lAnctaQ5WARbVBBBoEeQ}{master-03}{7.32.137.103}{7.32.137.103:9300}{m}{8.13.2}{7000099-8503000}, {master-01}{uyTujKeRRjugDIieqT2PEA}{FOlws3_PRrCGqqE7gpFTTw}{master-01}{7.32.137.110}{7.32.137.110:9304}{m}{8.13.2}{7000099-8503000}] from last-known cluster state; node term 66, last-accepted version 5607670 in term 66; joining [{client-02}{JkeQS1n_TiqgkQk2W0NwsA}{5gL9M2UiRGuPlfobpLZFPg}{client-02}{7.32.146.116}{7.32.146.116:9300}{mr}{8.13.2}{7000099-8503000}] in term [66] has status [waiting for local cluster applier] after [1.1m/70024ms]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.13/discovery-troubleshooting.html
Check the document Troubleshooting discovery | Elasticsearch Guide [8.13] | Elastic and learn that if you need to troubleshoot, you can only use jstack. I don't see more useful information in the existing es logs