Index relocation failing at 100% bytes

After upgrade from 7.17.23 to 7.17.26 I started seeing patterns like this:

  • ILM starts relocating from warm tier to cold
  • in /_cat/allocation transfer goes to 100% bp
  • nothing in log of source node
  • error in log of target node
  • repeat

[2025-01-08T13:07:16,075][WARN ][o.e.i.c.IndicesClusterStateService] [Cold2] [set1_228][1] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [set1_228][1]: Recovery failed from {Warm1}{id-removed}{id-removed}{192.168.0.1}{192.168.0.1:9300}{hiw}{xpack.installed=true, transform.node=false} into {Cold2}{id-removed}{id-removed}{192.168.0.12}{192.168.0.12:9300}{cmv}{xpack.installed=true, transform.node=false} (failed to retry recovery)
        at org.elasticsearch.indices.recovery.RecoveriesCollection.resetRecovery(RecoveriesCollection.java:137) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.retryRecovery(PeerRecoveryTargetService.java:199) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.retryRecovery(PeerRecoveryTargetService.java:195) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:767) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:380) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.26.jar:7.17.26]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.lang.Thread.run(Thread.java:1570) [?:?]
Caused by: java.lang.IllegalStateException: cannot reset recovery as previous attempt made it past finalization step
        at org.elasticsearch.indices.recovery.RecoveryTarget.resetRecovery(RecoveryTarget.java:241) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.RecoveriesCollection.resetRecovery(RecoveriesCollection.java:114) ~[elasticsearch-7.17.26.jar:7.17.26]
        ... 10 more
[2025-01-08T13:13:57,339][WARN ][o.e.i.c.IndicesClusterStateService] [Cold2] [set2_228][1] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [set2_228][1]: Recovery failed from {Warm1}{id-removed}{id-removed}{192.168.0.1}{192.168.0.1:9300}{hiw}{xpack.installed=true, transform.node=false} into {Cold2}{id-removed}{id-removed}{192.168.0.12}{192.168.0.12:9300}{cmv}{xpack.installed=true, transform.node=false} (failed to retry recovery)
        at org.elasticsearch.indices.recovery.RecoveriesCollection.resetRecovery(RecoveriesCollection.java:137) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.retryRecovery(PeerRecoveryTargetService.java:199) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.retryRecovery(PeerRecoveryTargetService.java:195) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:767) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:380) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.26.jar:7.17.26]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.lang.Thread.run(Thread.java:1570) [?:?]
Caused by: java.lang.IllegalStateException: cannot reset recovery as previous attempt made it past finalization step
        at org.elasticsearch.indices.recovery.RecoveryTarget.resetRecovery(RecoveryTarget.java:241) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.RecoveriesCollection.resetRecovery(RecoveriesCollection.java:114) ~[elasticsearch-7.17.26.jar:7.17.26]
        ... 10 more

Internet search for "cannot reset recovery as previous attempt made it past finalization step" gives no results for my profile.

Updates

  • Relocating another index from Cold back to Warm works fine
  • Relocating another index from different Warm node to Cold works fine
  • Relocating affected index from Warm node to different warm node fails
  • Relocating affected shard to Cold still fails today
  • Relocating different shard of same index to affected Warm node works
  • Relocating this different shard to other Warm node works
  • Relocating original affected shard to same Warn node fails

Looks like it is currently about 4 shards stuck like this. This is more that outgoing recovery limit, so there is always 2-3 shards transfering and one waiting. It looks like because of this each transfer starts from zero?

I override all but one back to warm tier and now it looks like cold node can keep transfered data between attemps. This speeds up loop to like 10 attempts per second. Error is same.

So it looks like it is particular shards that are affected by this.
Other shards of the same index and other indexes move fine.

Decided to reboot node hosting affected shards.

No errors in logs but replicas did not want to synchronize and stuck in INITIALIZING state for more than normal. I was worried but after about 10 minutes it finally synchronized.

  • Trying to move originally affected shard again now

Good news. After that restart indexes move with no problems.

Thanks so much for sharing these troubleshooting details, @nisow95612.