It seems that the shard recovery for replica is stuck in translog recovery loop. The recovery stage for index
goes through but the translog
stage gets stuck at 100% with no activity for 30m causing a shard-failed and the loop continues. The cluster stays YELLOW with shard stuck INITAILIZING
Options tried so far
- Flush
- Synced flush
- Merge
- Replica switch from 1 to 0 and back
- Snapshot restore
curl -s localhost:9200/_cat/recovery?active_only | grep index-a
index-a 0 23.8m peer translog 10.xxx.zz.yy RFKcotq 10.xxx.xx.xxx 5r8hFLX n/a n/a 289 289 100.0% 289 41607621033 41607621033 100.0% 41607621033 0 0 100.0%
index-a 1 25.1m peer translog 10.xxx.yy.z d4V3wQi 10.xxx.xx.xxx 5r8hFLX n/a n/a 386 386 100.0% 386 41622893547 41622893547 100.0% 41622893547 0 0 100.0%
Master logs
[DEBUG][o.e.c.s.MasterService ] [1tGegQy] processing [shard-failed[shard id [[index-a][0]], allocation id [MXEBhXqzRSaqlAAnvqLbvw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[index-a][0]: Recovery failed from {j9nurFX}{j9nurFXORuu3HQqe2ejusA}{aG66ze-lRjq7EftXXiDBPA}{10.xxx.xx.xx}{10.xxx.xx.xx:9300}{zone=us-east-1b} into {4YTNoLW}{4YTNoLWWQJOxD26OC_8rQw}{lUVUDg7GQF2FmZM_Mri4_A}{10.xxx.xx.xxx}{10.xxx.xx.xxx:9300}{zone=us-east-1a}]; nested: RemoteTransportException[[j9nurFX][10.xxx.xx.xx:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [315] files with total size of [38.7gb]]; nested: ReceiveTimeoutTransportException[[4YTNoLW][10.xxx.xx.xxx:9300][internal:index/shard/recovery/file_chunk] request_id [686636395] timed out after [900000ms]]; ]]]: took [30s] done publishing updated cluster state (version: 963321, uuid: 5XxUS-74QoWDDPks6yyUug)
[2019-08-25T00:31:41,748][WARN ][o.e.c.s.MasterService ] [1tGegQy] cluster state update task [shard-failed[shard id [[index-a][0]], allocation id [MXEBhXqzRSaqlAAnvqLbvw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[index-a][0]: Recovery failed from {j9nurFX}{j9nurFXORuu3HQqe2ejusA}{aG66ze-lRjq7EftXXiDBPA}{10.xxx.xx.xx}{10.xxx.xx.xx:9300}{zone=us-east-1b} into {4YTNoLW}{4YTNoLWWQJOxD26OC_8rQw}{lUVUDg7GQF2FmZM_Mri4_A}{10.xxx.xx.xxx}{10.xxx.xx.xxx:9300}{zone=us-east-1a}]; nested: RemoteTransportException[[j9nurFX][10.xxx.xx.xx:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [315] files with total size of [38.7gb]]; nested: ReceiveTimeoutTransportException[[4YTNoLW][10.xxx.xx.xxx:9300][internal:index/shard/recovery/file_chunk] request_id [686636395] timed out after [900000ms]]; ]]] took [30s] above the warn threshold of 30s