Translog recovery stuck[ES 6.0]

It seems that the shard recovery for replica is stuck in translog recovery loop. The recovery stage for index goes through but the translog stage gets stuck at 100% with no activity for 30m causing a shard-failed and the loop continues. The cluster stays YELLOW with shard stuck INITAILIZING

Options tried so far

  1. Flush
  2. Synced flush
  3. Merge
  4. Replica switch from 1 to 0 and back
  5. Snapshot restore
curl -s localhost:9200/_cat/recovery?active_only | grep index-a
index-a 0 23.8m peer translog  RFKcotq 5r8hFLX n/a n/a 289 289 100.0% 289 41607621033 41607621033 100.0% 41607621033 0      0 100.0%
index-a 1 25.1m peer translog   d4V3wQi 5r8hFLX n/a n/a 386 386 100.0% 386 41622893547 41622893547 100.0% 41622893547 0      0 100.0%

Master logs

[DEBUG][o.e.c.s.MasterService    ] [1tGegQy] processing [shard-failed[shard id [[index-a][0]], allocation id [MXEBhXqzRSaqlAAnvqLbvw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[index-a][0]: Recovery failed from {j9nurFX}{j9nurFXORuu3HQqe2ejusA}{aG66ze-lRjq7EftXXiDBPA}{}{}{zone=us-east-1b} into {4YTNoLW}{4YTNoLWWQJOxD26OC_8rQw}{lUVUDg7GQF2FmZM_Mri4_A}{}{}{zone=us-east-1a}]; nested: RemoteTransportException[[j9nurFX][][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [315] files with total size of [38.7gb]]; nested: ReceiveTimeoutTransportException[[4YTNoLW][][internal:index/shard/recovery/file_chunk] request_id [686636395] timed out after [900000ms]]; ]]]: took [30s] done publishing updated cluster state (version: 963321, uuid: 5XxUS-74QoWDDPks6yyUug)
[2019-08-25T00:31:41,748][WARN ][o.e.c.s.MasterService    ] [1tGegQy] cluster state update task [shard-failed[shard id [[index-a][0]], allocation id [MXEBhXqzRSaqlAAnvqLbvw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[index-a][0]: Recovery failed from {j9nurFX}{j9nurFXORuu3HQqe2ejusA}{aG66ze-lRjq7EftXXiDBPA}{}{}{zone=us-east-1b} into {4YTNoLW}{4YTNoLWWQJOxD26OC_8rQw}{lUVUDg7GQF2FmZM_Mri4_A}{}{}{zone=us-east-1a}]; nested: RemoteTransportException[[j9nurFX][][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [315] files with total size of [38.7gb]]; nested: ReceiveTimeoutTransportException[[4YTNoLW][][internal:index/shard/recovery/file_chunk] request_id [686636395] timed out after [900000ms]]; ]]] took [30s] above the warn threshold of 30s

@DavidTurner any pointers here would be helpful. Is this a known issue and has it been solved already?

The very few log messages you've provided don't tie in with your description of the problem:

ReceiveTimeoutTransportException[[4YTNoLW][][internal:index/shard/recovery/file_chunk] request_id [686636395] timed out after [900000ms]

This tells us that the recovery failed during the index phase. I'm not aware of a known issue, but 6.0 is EOL. Can you reproduce this on a supported version? Do you see any more useful logs on the recovery source or target nodes?

Thanks @DavidTurner
I have not been able to reproduce it so far. Have updated the logs from data nodes, hot threads and recovery stats here .
Let me know if you need additional info. I'm meanwhile trying to pull up additional relevant logs

The log says index phase while I think its probably the same message when translog recovery fails too. I haven't verified this though

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.