My cluster keep dropping node and changing master "failed to writeCluster State"

Hello,
My cluster keep dropping one node randomly everytime, which made the stack instable, sometimes it takes more than 3 hours to rejoin that node and recover.
My cluster compose of 4 nodes :
1 coordinator .
3 data nodes which are master eligible.

Trying to troubleshoot I found this warnings in the cluster log

 [cluster:monitor/nodes/stats[n]], node [{node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}{xpack.installed=true, transform.node=true}], id [557660636]
[2022-02-15T10:06:08,021][WARN ][o.e.t.TransportService   ] [node3] Received response for a request that has timed out, sent [11m/660143ms] ago, timed out [10.7m/645126ms] ago, action [indices:monitor/stats[n]], node [{node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}{xpack.installed=true, transform.node=true}], id [557660638]
[2022-02-15T10:06:36,209][WARN ][o.e.g.PersistedClusterStateService] [node3] writing cluster state took [76801ms] which is above the warn threshold of [10s]; wrote global metadata [true] and metadata for [0] indices and skipped [751] unchanged indices
[2022-02-15T10:06:36,210][INFO ][o.e.c.s.ClusterApplierService] [node3] master node changed {previous [{node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}], current []}, term: 5072, version: 373355, reason: becoming candidate: Publication.onCompletion(false)
[2022-02-15T10:06:36,210][WARN ][o.e.c.s.MasterService    ] [node3] failing [put-pipeline-rs-auditbeat-enrich]: failed to commit cluster state version [373356]
[2022-02-15T10:06:55,864][WARN ][o.e.t.ThreadPool         ] [node3] timer thread slept for [19.5s/19591ms] on absolute clock which is above the warn threshold of [5000ms]
[2022-02-15T10:06:55,865][WARN ][o.e.t.ThreadPool         ] [node3] timer thread slept for [19.5s/19590713912ns] on relative clock which is above the warn threshold of [5000ms]
[2022-02-15T10:06:55,883][WARN ][o.e.c.InternalClusterInfoService] [node3] failed to retrieve stats for node [vIppyaoJSnKBbdzpsmxQPQ]: [node3][192.168.22.249:9300][cluster:monitor/nodes/stats[n]] request_id [557663890] timed out after [26796ms]
[2022-02-15T10:06:55,884][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node3] master not discovered or elected yet, an election requires at least 2 nodes with ids from [vIppyaoJSnKBbdzpsmxQPQ, Cn7trT9HR5uu0Jbitp1SzQ, 56cVsSpASAGUEXn10efqfg], have only discovered non-quorum [{node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}]; discovery will continue using [192.168.22.247:9300, 192.168.22.248:9300] from hosts providers and [{node1}{56cVsSpASAGUEXn10efqfg}{1i2ieqWdRCGKO8G356mBwQ}{node1.net}{192.168.22.247:9300}{dim}, {node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}, {node2}{Cn7trT9HR5uu0Jbitp1SzQ}{xLefKg79SzyVidQqvbnaIg}{node2.net}{192.168.22.248:9300}{dim}] from last-known cluster state; node term 5072, last-accepted version 373356 in term 5072
[2022-02-15T10:06:55,885][WARN ][o.e.c.InternalClusterInfoService] [node3] failed to retrieve shard stats from node [vIppyaoJSnKBbdzpsmxQPQ]: [node3][192.168.22.249:9300][indices:monitor/stats[n]] request_id [557663893] timed out after [26796ms]

I used GET /_cluster/state API to download my cluster state the file size is 149 MB. is that normal ??

As the error states it looks like there are issues writing the cluster state. What type of storage are you using? What load is the cluster under? Which version of Elasticsearch are you using?

The cluster load read from stack monitoring / nodes page !

What does disk I/O and iowait look like? Are you running an index heavy workload?

iowait is between 2.4 and 3, some times peaks to over 13.
For the index heavy workload, I didn't tune Elasticsearch for indexing, but i tuned my filebeat agents for better indexing using bulk size of 1024 and multiple workers.

After viewing data on stack monitoring I see that the effected node yesterday reached a load of 39.2 before it was dropped of the cluster.

Sounds like your cluster is overloaded, possibly due to using a slow disk. I would recommend increasing the size of the cluster or try upgrading to SSD.

@Christian_Dahlqvist Thank you very much, I think I need to add some nodes.

I don't know if there is any specific answer for this, but how much is Elasticsearch troughtput limit per node.

That depends on the hardware used, data and sometimes even method of ingestion.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.