Since upgrading to 8.17.2, and even now on 8.18.0, snapshots to S3 are causing nodes to drop out of the cluster due to the 'followers check retry count exceeded' error. This seems to be happening only on cold data nodes, though I haven't been able to get a completely successful snapshot since the issue started so maybe the hot/warm nodes are also affected. Also, when the node rejoins the cluster there is a lot of rebalancing that starts. This did not happen prior to 8.17.2.
The cold nodes are all physical servers with 8 CPU, minimum of 32GB memory, and SSD storage. The coordinator nodes are virtual machines with 8vCPU, 16GB memory. The cold nodes have between 800 and 850 indices, with 1 primary and 1 replica per index. I've ruled out hardware issues due to the fact that the node that drops from the cluster is not always the same.
Right about the time the nodes drop from the cluster there are garbage collection and thread timer warnings. From the data node:
[2025-05-07T10:30:24,327][INFO ][o.e.m.j.JvmGcMonitorService] [data1] [gc][80668] overhead, spent [324ms] collecting in the last [1s]
[2025-05-07T10:32:08,293][INFO ][o.e.m.j.JvmGcMonitorService] [data1] [gc][80771] overhead, spent [490ms] collecting in the last [1s]
[2025-05-07T10:33:01,460][WARN ][o.e.m.j.JvmGcMonitorService] [data1] [gc][G1 Concurrent GC][80789][167] duration [35.7s], collections [1]/[35.9s], total [35.7s]/[9.5m], memory [6.8gb]->[6.8gb]/[19.4gb], all_pools {[CodeHeap 'non-nmethods'] [2.6mb]->[2.6mb]/[5.5mb]}{[Metaspace] [197.9mb]->[197.9mb]/[0b]}{[CodeHeap 'profiled nmethods'] [18mb]->[17.9mb]/[117.2mb]}{[Compressed Class Space] [24mb]->[24mb]/[1gb]}{[young] [1.3gb]->[1.3gb]/[0b]}{[old] [5.4gb]->[5.4gb]/[19.4gb]}{[survivor] [43.6mb]->[43.6mb]/[0b]}{[CodeHeap 'non-profiled nmethods'] [33.3mb]->[33.3mb]/[117.2mb]}
[2025-05-07T10:33:01,461][WARN ][o.e.m.j.JvmGcMonitorService] [data1] [gc][80789] overhead, spent [35.7s] collecting in the last [35.9s]
[2025-05-07T10:33:01,644][WARN ][o.e.t.ThreadPool ] [data1] timer thread slept for [35.9s/35911ms] on absolute clock which is above the warn threshold of [5000ms]
[2025-05-07T10:33:01,660][WARN ][o.e.t.ThreadPool ] [data1] timer thread slept for [35.9s/35911037136ns] on relative clock which is above the warn threshold of [5000ms]
[2025-05-07T10:33:04,678][INFO ][o.e.c.c.Coordinator ] [data1] [3] consecutive checks of the master node [{coord3}{j1EVgFV6Seieg20FZWiGDQ}{xTohzQf3R6O4yW5oGI9Pjw}{coord3}{192.168.1.26}{192.168.1.26:9300}{mt}{8.18.0}{7000099-8525000}] were unsuccessful ([3] rejected, [0] timed out), restarting discovery; more details may be available in the master node logs [last unsuccessful check: rejecting check since [{data1}{RgMem9IQSeGRgi-Xu40SwA}{ePJgynj6SZ6f3s8kiMcW-Q}{data1}{192.168.1.40}{192.168.1.40:9300}{cfs}{8.18.0}{7000099-8525000}] has been removed from the cluster]
From the master node at the same time:
[2025-05-07T10:32:57,845][INFO ][o.e.c.c.C.CoordinatorPublication] [coord3] after [10s] publication of cluster state version [5539940] is still waiting for {data1}{RgMem9IQSeGRgi-Xu40SwA}{ePJgynj6SZ6f3s8kiMcW-Q}{data1}{192.168.1.40}{192.168.1.40:9300}{cfs}{8.18.0}{7000099-8525000}{xpack.installed=true, ml.config_version=12.0.0, transform.config_version=10.0.0} [SENT_PUBLISH_REQUEST]
[2025-05-07T10:33:01,461][WARN ][o.e.t.TransportService ] [coord3] Received response for a request that has timed out, sent [35s/35020ms] ago, timed out [25s/25015ms] ago, action [internal:coordination/fault_detection/follower_check], node [{data1}{RgMem9IQSeGRgi-Xu40SwA}{ePJgynj6SZ6f3s8kiMcW-Q}{data1}{192.168.1.40}{192.168.1.40:9300}{cfs}{8.18.0}{7000099-8525000}{xpack.installed=true, ml.config_version=12.0.0, transform.config_version=10.0.0}], id [7804453]
[2025-05-07T10:33:01,461][WARN ][o.e.t.TransportService ] [coord3] Received response for a request that has timed out, sent [24s/24014ms] ago, timed out [14s/14009ms] ago, action [internal:coordination/fault_detection/follower_check], node [{data1}{RgMem9IQSeGRgi-Xu40SwA}{ePJgynj6SZ6f3s8kiMcW-Q}{data1}{192.168.1.40}{192.168.1.40:9300}{cfs}{8.18.0}{7000099-8525000}{xpack.installed=true, ml.config_version=12.0.0, transform.config_version=10.0.0}], id [7804708]
[2025-05-07T10:33:01,462][WARN ][o.e.t.TransportService ] [coord3] Received response for a request that has timed out, sent [13s/13009ms] ago, timed out [3s/3002ms] ago, action [internal:coordination/fault_detection/follower_check], node [{data1}{RgMem9IQSeGRgi-Xu40SwA}{ePJgynj6SZ6f3s8kiMcW-Q}{data1}{192.168.1.40}{192.168.1.40:9300}{cfs}{8.18.0}{7000099-8525000}{xpack.installed=true, ml.config_version=12.0.0, transform.config_version=10.0.0}], id [7805045]
[2025-05-07T10:33:02,358][INFO ][o.e.c.r.a.AllocationService] [coord3] current.health="YELLOW" message="Cluster health status changed from [GREEN] to [YELLOW] (reason: [{data1}{RgMem9IQSeGRgi-Xu40SwA}{ePJgynj6SZ6f3s8kiMcW-Q}{data1}{192.168.1.40}{192.168.1.40:9300}{cfs}{8.18.0}{7000099-8525000} reason: followers check retry count exceeded [timeouts=3, failures=0]])." previous.health="GREEN" reason="{data1}{RgMem9IQSeGRgi-Xu40SwA}{ePJgynj6SZ6f3s8kiMcW-Q}{data1}{192.168.1.40}{192.168.1.40:9300}{cfs}{8.18.0}{7000099-8525000} reason: followers check retry count exceeded [timeouts=3, failures=0]"
The elasticsearch software is installed using rpm packages (RHEL 9.5), and the TCP retry count has been set to 5 per the guidance at https://www.elastic.co/docs/deploy-manage/deploy/self-managed/system-config-tcpretries. I've also tried reducing the number of connections to S3 using s3.client.default.max_connections: 40 in the elasticsearch.yml on each node.
I'm out of ideas on what to try next to remedy the situation.