Hello there,
i want to ask something about this. i haven't made any changes and i just wanted to check my cluster but suddenly one of my data node was offline and my cluster state from green to yellow. i found much log like this in the log file of offline data node:
[2022-09-08T07:19:54,303][WARN ][o.e.i.r.RecoverySettings ] [data-17] Unable to acquire permit to use snapshot files during recovery, this recovery will recover index files from the source node. Ensure snapshot files can be used during recovery by setting [indices.recovery.max_concurrent_snapshot_file_downloads] to be no greater than [25]
and i found this too:
-
[2022-09-08T07:09:54,161][WARN ][o.e.g.PersistedClusterStateService] [data-17] writing cluster state took [279543ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [1200] unchanged indices
-
[2022-09-08T07:10:59,305][WARN ][o.e.m.f.FsHealthService ] [data-17] health check of [/elasticsearch/elasticsearch-7.17.0/nodes/0] took [82137ms] which is above the warn threshold of [5s]
-
[2022-09-08T07:11:54,412][INFO ][o.e.c.c.Coordinator ] [data-17] [3] consecutive checks of the master node [{master-1}{_MS9jkxdTv2wCscd0gFmyw}{ogzG8SjVQ0iz65K0VbSPvA}{10.37.187.31}{10.37.187.31:9300}{imrt}] were unsuccessful ([3] rejected, [0] timed out), restarting discovery; more details may be available in the master node logs [last unsuccessful check: rejecting check since [{data-17}{6rXNZDZgRiiu3TPBD1jncQ}{YTHMD4i4S3eYkjaTzPDRdg}{10.37.187.50}{10.37.187.50:9300}{dilrt}] has been removed from the cluster]
when i take a look at one of my master node log, i found this log related to offline data node:
[2022-09-08T07:05:13,915][WARN ][o.e.c.InternalClusterInfoService] [master-1] failed to retrieve stats for node [6rXNZDZgRiiu3TPBD1jncQ]: [data-17][10.37.187.50:9300][cluster:monitor/nodes/stats[n]] request_id [109962722] timed out after [15008ms]
[2022-09-08T07:05:13,927][WARN ][o.e.c.InternalClusterInfoService] [master-1] failed to retrieve shard stats from node [6rXNZDZgRiiu3TPBD1jncQ]: [data-17][10.37.187.50:9300][indices:monitor/stats[n]] request_id [109962729] timed out after [15008ms]
do you think this is caused by network issue? Or this could happen because of other issues such as overhead or something?
your response will be very helpful. Thanks