So I had a shard peer recovery stuck on ES 5.1 for 16 hours.
- _recovery APIs showed this stuck on INIT state
- shards themselves were small ~3 MB
- Had plenty free space on source and destination nodes
- I noticed that both source and destination nodes invovled in the peer recovery had the segment/lucene files which shows part of recovery process did succeed
- On enabling TRACE I am pretty sure this is a bug as after enabling TRACE logs, i could figure that the source node trying to relocate its shard to a destination was constantly trying and getting error of the form
[2017-10-04T17:45:47,586][TRACE][o.e.i.r.PeerRecoveryTargetService] [wb_dtDf] [logs1][3] Got exception on recovery
org.elasticsearch.transport.RemoteTransportException: [qx_Ng9r][x.x.x.x:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.shard.IndexShardRelocatedException: CurrentState[RELOCATED] Already relocated
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:162) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:119) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:128) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:125) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1385) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
- Based on this , I realized maybe deleting the partially copied index on the destination will get the recovery to move forward which did fix my problem.
However, I think this is a bug and the recovery code should have internally tried to delete the copied contents of the shard on destination node after getting 'IndexShardRelocatedException: CurrentState[RELOCATED] Already relocated' almost every minute for 16 hours worth of attempts.