My cluster keep dropping node and changing master "failed to writeCluster State"

A_Abdellah · February 15, 2022, 9:44am

Hello,
My cluster keep dropping one node randomly everytime, which made the stack instable, sometimes it takes more than 3 hours to rejoin that node and recover.
My cluster compose of 4 nodes :
1 coordinator .
3 data nodes which are master eligible.

Trying to troubleshoot I found this warnings in the cluster log

 [cluster:monitor/nodes/stats[n]], node [{node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}{xpack.installed=true, transform.node=true}], id [557660636]
[2022-02-15T10:06:08,021][WARN ][o.e.t.TransportService   ] [node3] Received response for a request that has timed out, sent [11m/660143ms] ago, timed out [10.7m/645126ms] ago, action [indices:monitor/stats[n]], node [{node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}{xpack.installed=true, transform.node=true}], id [557660638]
[2022-02-15T10:06:36,209][WARN ][o.e.g.PersistedClusterStateService] [node3] writing cluster state took [76801ms] which is above the warn threshold of [10s]; wrote global metadata [true] and metadata for [0] indices and skipped [751] unchanged indices
[2022-02-15T10:06:36,210][INFO ][o.e.c.s.ClusterApplierService] [node3] master node changed {previous [{node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}], current []}, term: 5072, version: 373355, reason: becoming candidate: Publication.onCompletion(false)
[2022-02-15T10:06:36,210][WARN ][o.e.c.s.MasterService    ] [node3] failing [put-pipeline-rs-auditbeat-enrich]: failed to commit cluster state version [373356]
[2022-02-15T10:06:55,864][WARN ][o.e.t.ThreadPool         ] [node3] timer thread slept for [19.5s/19591ms] on absolute clock which is above the warn threshold of [5000ms]
[2022-02-15T10:06:55,865][WARN ][o.e.t.ThreadPool         ] [node3] timer thread slept for [19.5s/19590713912ns] on relative clock which is above the warn threshold of [5000ms]
[2022-02-15T10:06:55,883][WARN ][o.e.c.InternalClusterInfoService] [node3] failed to retrieve stats for node [vIppyaoJSnKBbdzpsmxQPQ]: [node3][192.168.22.249:9300][cluster:monitor/nodes/stats[n]] request_id [557663890] timed out after [26796ms]
[2022-02-15T10:06:55,884][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node3] master not discovered or elected yet, an election requires at least 2 nodes with ids from [vIppyaoJSnKBbdzpsmxQPQ, Cn7trT9HR5uu0Jbitp1SzQ, 56cVsSpASAGUEXn10efqfg], have only discovered non-quorum [{node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}]; discovery will continue using [192.168.22.247:9300, 192.168.22.248:9300] from hosts providers and [{node1}{56cVsSpASAGUEXn10efqfg}{1i2ieqWdRCGKO8G356mBwQ}{node1.net}{192.168.22.247:9300}{dim}, {node3}{vIppyaoJSnKBbdzpsmxQPQ}{mZ_t280vQgmqv1G8ogeRKg}{node3.net}{192.168.22.249:9300}{dimt}, {node2}{Cn7trT9HR5uu0Jbitp1SzQ}{xLefKg79SzyVidQqvbnaIg}{node2.net}{192.168.22.248:9300}{dim}] from last-known cluster state; node term 5072, last-accepted version 373356 in term 5072
[2022-02-15T10:06:55,885][WARN ][o.e.c.InternalClusterInfoService] [node3] failed to retrieve shard stats from node [vIppyaoJSnKBbdzpsmxQPQ]: [node3][192.168.22.249:9300][indices:monitor/stats[n]] request_id [557663893] timed out after [26796ms]

A_Abdellah · February 16, 2022, 2:40pm

I used GET /_cluster/state API to download my cluster state the file size is 149 MB. is that normal ??

Christian_Dahlqvist · February 16, 2022, 4:00pm

As the error states it looks like there are issues writing the cluster state. What type of storage are you using? What load is the cluster under? Which version of Elasticsearch are you using?

A_Abdellah · February 17, 2022, 7:35am

The cluster load read from stack monitoring / nodes page !

Christian_Dahlqvist · February 17, 2022, 8:22am

What does disk I/O and iowait look like? Are you running an index heavy workload?

A_Abdellah · February 17, 2022, 8:48am

iowait is between 2.4 and 3, some times peaks to over 13.
For the index heavy workload, I didn't tune Elasticsearch for indexing, but i tuned my filebeat agents for better indexing using bulk size of 1024 and multiple workers.

After viewing data on stack monitoring I see that the effected node yesterday reached a load of 39.2 before it was dropped of the cluster.

Christian_Dahlqvist · February 17, 2022, 10:15am

Sounds like your cluster is overloaded, possibly due to using a slow disk. I would recommend increasing the size of the cluster or try upgrading to SSD.

A_Abdellah · February 17, 2022, 10:44am

@Christian_Dahlqvist Thank you very much, I think I need to add some nodes.

I don't know if there is any specific answer for this, but how much is Elasticsearch troughtput limit per node.

Christian_Dahlqvist · February 17, 2022, 11:13am

That depends on the hardware used, data and sometimes even method of ingestion.

system · March 17, 2022, 11:14am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.