Unable to acquire permit to use snapshot files during recovery

Hello there,

i want to ask something about this. i haven't made any changes and i just wanted to check my cluster but suddenly one of my data node was offline and my cluster state from green to yellow. i found much log like this in the log file of offline data node:

[2022-09-08T07:19:54,303][WARN ][o.e.i.r.RecoverySettings ] [data-17] Unable to acquire permit to use snapshot files during recovery, this recovery will recover index files from the source node. Ensure snapshot files can be used during recovery by setting [indices.recovery.max_concurrent_snapshot_file_downloads] to be no greater than [25]

and i found this too:

  • [2022-09-08T07:09:54,161][WARN ][o.e.g.PersistedClusterStateService] [data-17] writing cluster state took [279543ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [1200] unchanged indices

  • [2022-09-08T07:10:59,305][WARN ][o.e.m.f.FsHealthService ] [data-17] health check of [/elasticsearch/elasticsearch-7.17.0/nodes/0] took [82137ms] which is above the warn threshold of [5s]

  • [2022-09-08T07:11:54,412][INFO ][o.e.c.c.Coordinator ] [data-17] [3] consecutive checks of the master node [{master-1}{_MS9jkxdTv2wCscd0gFmyw}{ogzG8SjVQ0iz65K0VbSPvA}{10.37.187.31}{10.37.187.31:9300}{imrt}] were unsuccessful ([3] rejected, [0] timed out), restarting discovery; more details may be available in the master node logs [last unsuccessful check: rejecting check since [{data-17}{6rXNZDZgRiiu3TPBD1jncQ}{YTHMD4i4S3eYkjaTzPDRdg}{10.37.187.50}{10.37.187.50:9300}{dilrt}] has been removed from the cluster]

when i take a look at one of my master node log, i found this log related to offline data node:

[2022-09-08T07:05:13,915][WARN ][o.e.c.InternalClusterInfoService] [master-1] failed to retrieve stats for node [6rXNZDZgRiiu3TPBD1jncQ]: [data-17][10.37.187.50:9300][cluster:monitor/nodes/stats[n]] request_id [109962722] timed out after [15008ms]

[2022-09-08T07:05:13,927][WARN ][o.e.c.InternalClusterInfoService] [master-1] failed to retrieve shard stats from node [6rXNZDZgRiiu3TPBD1jncQ]: [data-17][10.37.187.50:9300][indices:monitor/stats[n]] request_id [109962729] timed out after [15008ms]

do you think this is caused by network issue? Or this could happen because of other issues such as overhead or something?

your response will be very helpful. Thanks

Are you able to please post a bit more logs, the context around these entries might be helpful.

i found this type of log:

> [2022-09-08T07:13:34,279][WARN ][o.e.a.b.TransportShardBulkAction] [data-17] [[metrics.ocp4-project.prod-esb-2022.09.07][0]] failed to perform indices:data/write/bulk[s] on replica [metrics.ocp4-project.prod-esb-2022.09.07][0], node[XLPmqmRLS8ePzj5cyeGoZQ], [R], s[STARTED], a[id=_vNoR2e2TCaSCPapp_qwrg]
> [2022-09-08T07:13:34,303][WARN ][o.e.a.b.TransportShardBulkAction] [data-17] [[metrics.ocp4-project.prod-esb-2022.09.07][0]] failed to perform indices:data/write/bulk[s] on replica [metrics.ocp4-project.prod-esb-2022.09.07][0], node[XLPmqmRLS8ePzj5cyeGoZQ], [R], s[STARTED], a[id=_vNoR2e2TCaSCPapp_qwrg]

and some of this:

[2022-09-08T07:13:14,446][WARN ][o.e.c.c.ClusterFormationFailureHelper] [data-17] master not discovered yet: have discovered [{data-17}{6rXNZDZgRiiu3TPBD1jncQ}{YTHMD4i4S3eYkjaTzPDRdg}{10.37.187.50}{10.37.187.50:9300}{dilrt}, {master-3}{Ax3huB15R_qNFDvGp-7Jzg}{E4NV2dFoQ-a8iiknwcodbw}{10.37.187.33}{10.37.187.33:9300}{imrt}, {master-1}{_MS9jkxdTv2wCscd0gFmyw}{ogzG8SjVQ0iz65K0VbSPvA}{10.37.187.31}{10.37.187.31:9300}{imrt}, {master-2}{eZwiy4LjSm6-C62fEplTyg}{O4ewI_ytRuu3j1a_1W0YBQ}{10.37.187.32}{10.37.187.32:9300}{imrt}]; discovery will continue using [10.37.187.31:9300, 10.37.187.32:9300, 10.37.187.33:9300, 10.37.187.34:9300, 10.37.187.35:9300, 10.37.187.36:9300, 10.37.187.37:9300, 10.37.187.38:9300, 10.37.187.39:9300, 10.37.187.40:9300, 10.37.187.41:9300, 10.37.187.42:9300, 10.37.187.43:9300, 10.37.187.44:9300, 10.37.187.45:9300, 10.37.187.46:9300, 10.37.187.47:9300, 10.37.187.48:9300, 10.37.187.49:9300, 10.37.187.51:9300, 10.37.187.52:9300, 10.37.187.53:9300, 10.37.187.64:9300, 10.37.187.65:9300] from hosts providers and [{master-3}{Ax3huB15R_qNFDvGp-7Jzg}{E4NV2dFoQ-a8iiknwcodbw}{10.37.187.33}{10.37.187.33:9300}{imrt}, {master-1}{_MS9jkxdTv2wCscd0gFmyw}{ogzG8SjVQ0iz65K0VbSPvA}{10.37.187.31}{10.37.187.31:9300}{imrt}, {master-2}{eZwiy4LjSm6-C62fEplTyg}{O4ewI_ytRuu3j1a_1W0YBQ}{10.37.187.32}{10.37.187.32:9300}{imrt}] from last-known cluster state; node term 35, last-accepted version 762458 in term 35

That would be why.

Again, we need to see more of your logs and not just snippets please.

Here are the full log when the issue occurs

what do you think?

400 Link does not exist

I don't know what to think about it...
Please use built-in </>.

You can download Here

Nobody will download it. Please use our built-in </>.

i can't send it. because it too large. i have used built-in

anyone can help me? just download the file, it's really safe. i don't put some virus inside it. trust me, i just need help

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.