Hello,
My cluster keep dropping one node randomly everytime, which made the stack instable, sometimes it takes more than 3 hours to rejoin that node and recover.
My cluster compose of 4 nodes :
1 coordinator .
3 data nodes which are master eligible.
Trying to troubleshoot I found this warnings in the cluster log
[cluster:monitor/nodes/stats[n]], node [{node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}{xpack.installed=true, transform.node=true}], id [557660636]
[2022-02-15T10:06:08,021][WARN ][o.e.t.TransportService ] [node3] Received response for a request that has timed out, sent [11m/660143ms] ago, timed out [10.7m/645126ms] ago, action [indices:monitor/stats[n]], node [{node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}{xpack.installed=true, transform.node=true}], id [557660638]
[2022-02-15T10:06:36,209][WARN ][o.e.g.PersistedClusterStateService] [node3] writing cluster state took [76801ms] which is above the warn threshold of [10s]; wrote global metadata [true] and metadata for [0] indices and skipped [751] unchanged indices
[2022-02-15T10:06:36,210][INFO ][o.e.c.s.ClusterApplierService] [node3] master node changed {previous [{node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}], current []}, term: 5072, version: 373355, reason: becoming candidate: Publication.onCompletion(false)
[2022-02-15T10:06:36,210][WARN ][o.e.c.s.MasterService ] [node3] failing [put-pipeline-rs-auditbeat-enrich]: failed to commit cluster state version [373356]
[2022-02-15T10:06:55,864][WARN ][o.e.t.ThreadPool ] [node3] timer thread slept for [19.5s/19591ms] on absolute clock which is above the warn threshold of [5000ms]
[2022-02-15T10:06:55,865][WARN ][o.e.t.ThreadPool ] [node3] timer thread slept for [19.5s/19590713912ns] on relative clock which is above the warn threshold of [5000ms]
[2022-02-15T10:06:55,883][WARN ][o.e.c.InternalClusterInfoService] [node3] failed to retrieve stats for node [vIppyaoJSnKBbdzpsmxQPQ]: [node3][192.168.22.249:9300][cluster:monitor/nodes/stats[n]] request_id [557663890] timed out after [26796ms]
[2022-02-15T10:06:55,884][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node3] master not discovered or elected yet, an election requires at least 2 nodes with ids from [vIppyaoJSnKBbdzpsmxQPQ, Cn7trT9HR5uu0Jbitp1SzQ, 56cVsSpASAGUEXn10efqfg], have only discovered non-quorum [{node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}]; discovery will continue using [192.168.22.247:9300, 192.168.22.248:9300] from hosts providers and [{node1}{56cVsSpASAGUEXn10efqfg}{1i2ieqWdRCGKO8G356mBwQ}{node1.net}{192.168.22.247:9300}{dim}, {node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}, {node2}{Cn7trT9HR5uu0Jbitp1SzQ}{xLefKg79SzyVidQqvbnaIg}{node2.net}{192.168.22.248:9300}{dim}] from last-known cluster state; node term 5072, last-accepted version 373356 in term 5072
[2022-02-15T10:06:55,885][WARN ][o.e.c.InternalClusterInfoService] [node3] failed to retrieve shard stats from node [vIppyaoJSnKBbdzpsmxQPQ]: [node3][192.168.22.249:9300][indices:monitor/stats[n]] request_id [557663893] timed out after [26796ms]