ES 5.1 - Shard recovery stuck in INIT- IndexShardRelocatedException: Already relocated

So I had a shard peer recovery stuck on ES 5.1 for 16 hours.

  • _recovery APIs showed this stuck on INIT state
  • shards themselves were small ~3 MB
  • Had plenty free space on source and destination nodes
  • I noticed that both source and destination nodes invovled in the peer recovery had the segment/lucene files which shows part of recovery process did succeed
  • On enabling TRACE I am pretty sure this is a bug as after enabling TRACE logs, i could figure that the source node trying to relocate its shard to a destination was constantly trying and getting error of the form

[2017-10-04T17:45:47,586][TRACE][o.e.i.r.PeerRecoveryTargetService] [wb_dtDf] [logs1][3] Got exception on recovery
org.elasticsearch.transport.RemoteTransportException: [qx_Ng9r][x.x.x.x:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.shard.IndexShardRelocatedException: CurrentState[RELOCATED] Already relocated
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget( ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover( ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100( ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived( ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived( ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived( ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived( ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun( ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun( [elasticsearch-5.1.1.jar:5.1.1]
at [elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker( [?:1.8.0_112]
at java.util.concurrent.ThreadPoolExecutor$ [?:1.8.0_112]
at [?:1.8.0_112]

  • Based on this , I realized maybe deleting the partially copied index on the destination will get the recovery to move forward which did fix my problem.

However, I think this is a bug and the recovery code should have internally tried to delete the copied contents of the shard on destination node after getting 'IndexShardRelocatedException: CurrentState[RELOCATED] Already relocated' almost every minute for 16 hours worth of attempts.

Please raise this on Github so we can take a closer look -

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.