Hi,
Obviously, this error indicates cluster overload but this has become more frequent after adding more data nodes to cluster and separating out master only nodes. Also the throughput of the cluster has reduced.
There is hardly any search request as of now in this cluster.
Here is the configuration details.
ES version - 7.5.2
Data Nodes - 75+
Master Only Nodes - 3
Data/Ingest Nodes - 3
Total Nodes - 3
Data Details.
Logstash - 25+ Intances
Index Count - 1
Shards - 80
Replica - 1 (1primary + 1 replica)
ILM used to rollover after 1TB.
About 12-14 rollovers a day.
So daily 12-14 indexes of 1TB is created.
There is no firewall or network latency between nodes of the cluster. They are in same DC and physically almost side by side.
Log of master node is full of below errors.
Received response for a request that has timed out, sent [11608ms] ago, timed out [1603ms] ago, action [internal:coordination/fault_detection/follower_check], node ............................
[o.e.c.c.C.CoordinatorPublication] [Master Node Name] after [30s] publication of cluster state version [number] is still waiting for {Data Node Name}..........................., xpack.installed=true} [SENT_APPLY_COMMIT], {Data Node name}{............... xpack.installed=true} [SENT_APPLY_COMMIT]
As a result of above error nodes are sometimes removed and added back when 3 consecutive follower_check or leader fails.
Please let me know if any other information is needed for above issue.
Thanks,
Ankit.