After upgrade from 7.17.23 to 7.17.26 I started seeing patterns like this:
- ILM starts relocating from warm tier to cold
- in
/_cat/allocation
transfer goes to 100% bp - nothing in log of source node
- error in log of target node
- repeat
[2025-01-08T13:07:16,075][WARN ][o.e.i.c.IndicesClusterStateService] [Cold2] [set1_228][1] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [set1_228][1]: Recovery failed from {Warm1}{id-removed}{id-removed}{192.168.0.1}{192.168.0.1:9300}{hiw}{xpack.installed=true, transform.node=false} into {Cold2}{id-removed}{id-removed}{192.168.0.12}{192.168.0.12:9300}{cmv}{xpack.installed=true, transform.node=false} (failed to retry recovery)
at org.elasticsearch.indices.recovery.RecoveriesCollection.resetRecovery(RecoveriesCollection.java:137) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.retryRecovery(PeerRecoveryTargetService.java:199) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.retryRecovery(PeerRecoveryTargetService.java:195) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:767) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:380) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.26.jar:7.17.26]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.lang.Thread.run(Thread.java:1570) [?:?]
Caused by: java.lang.IllegalStateException: cannot reset recovery as previous attempt made it past finalization step
at org.elasticsearch.indices.recovery.RecoveryTarget.resetRecovery(RecoveryTarget.java:241) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.indices.recovery.RecoveriesCollection.resetRecovery(RecoveriesCollection.java:114) ~[elasticsearch-7.17.26.jar:7.17.26]
... 10 more
[2025-01-08T13:13:57,339][WARN ][o.e.i.c.IndicesClusterStateService] [Cold2] [set2_228][1] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [set2_228][1]: Recovery failed from {Warm1}{id-removed}{id-removed}{192.168.0.1}{192.168.0.1:9300}{hiw}{xpack.installed=true, transform.node=false} into {Cold2}{id-removed}{id-removed}{192.168.0.12}{192.168.0.12:9300}{cmv}{xpack.installed=true, transform.node=false} (failed to retry recovery)
at org.elasticsearch.indices.recovery.RecoveriesCollection.resetRecovery(RecoveriesCollection.java:137) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.retryRecovery(PeerRecoveryTargetService.java:199) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.retryRecovery(PeerRecoveryTargetService.java:195) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:767) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:380) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.26.jar:7.17.26]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.lang.Thread.run(Thread.java:1570) [?:?]
Caused by: java.lang.IllegalStateException: cannot reset recovery as previous attempt made it past finalization step
at org.elasticsearch.indices.recovery.RecoveryTarget.resetRecovery(RecoveryTarget.java:241) [elasticsearch-7.17.26.jar:7.17.26]
at org.elasticsearch.indices.recovery.RecoveriesCollection.resetRecovery(RecoveriesCollection.java:114) ~[elasticsearch-7.17.26.jar:7.17.26]
... 10 more
Internet search for "cannot reset recovery as previous attempt made it past finalization step" gives no results for my profile.
Updates
- Relocating another index from Cold back to Warm works fine
- Relocating another index from different Warm node to Cold works fine
- Relocating affected index from Warm node to different warm node fails
- Relocating affected shard to Cold still fails today
- Relocating different shard of same index to affected Warm node works
- Relocating this different shard to other Warm node works
- Relocating original affected shard to same Warn node fails