Index relocation failing at 100% bytes

nisow95612 · January 8, 2025, 12:28pm

After upgrade from 7.17.23 to 7.17.26 I started seeing patterns like this:

ILM starts relocating from warm tier to cold
in /_cat/allocation transfer goes to 100% bp
nothing in log of source node
error in log of target node
repeat


[2025-01-08T13:07:16,075][WARN ][o.e.i.c.IndicesClusterStateService] [Cold2] [set1_228][1] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [set1_228][1]: Recovery failed from {Warm1}{id-removed}{id-removed}{192.168.0.1}{192.168.0.1:9300}{hiw}{xpack.installed=true, transform.node=false} into {Cold2}{id-removed}{id-removed}{192.168.0.12}{192.168.0.12:9300}{cmv}{xpack.installed=true, transform.node=false} (failed to retry recovery)
        at org.elasticsearch.indices.recovery.RecoveriesCollection.resetRecovery(RecoveriesCollection.java:137) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.retryRecovery(PeerRecoveryTargetService.java:199) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.retryRecovery(PeerRecoveryTargetService.java:195) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:767) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:380) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.26.jar:7.17.26]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.lang.Thread.run(Thread.java:1570) [?:?]
Caused by: java.lang.IllegalStateException: cannot reset recovery as previous attempt made it past finalization step
        at org.elasticsearch.indices.recovery.RecoveryTarget.resetRecovery(RecoveryTarget.java:241) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.RecoveriesCollection.resetRecovery(RecoveriesCollection.java:114) ~[elasticsearch-7.17.26.jar:7.17.26]
        ... 10 more
[2025-01-08T13:13:57,339][WARN ][o.e.i.c.IndicesClusterStateService] [Cold2] [set2_228][1] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [set2_228][1]: Recovery failed from {Warm1}{id-removed}{id-removed}{192.168.0.1}{192.168.0.1:9300}{hiw}{xpack.installed=true, transform.node=false} into {Cold2}{id-removed}{id-removed}{192.168.0.12}{192.168.0.12:9300}{cmv}{xpack.installed=true, transform.node=false} (failed to retry recovery)
        at org.elasticsearch.indices.recovery.RecoveriesCollection.resetRecovery(RecoveriesCollection.java:137) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.retryRecovery(PeerRecoveryTargetService.java:199) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.retryRecovery(PeerRecoveryTargetService.java:195) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:767) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:380) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.26.jar:7.17.26]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.lang.Thread.run(Thread.java:1570) [?:?]
Caused by: java.lang.IllegalStateException: cannot reset recovery as previous attempt made it past finalization step
        at org.elasticsearch.indices.recovery.RecoveryTarget.resetRecovery(RecoveryTarget.java:241) [elasticsearch-7.17.26.jar:7.17.26]
        at org.elasticsearch.indices.recovery.RecoveriesCollection.resetRecovery(RecoveriesCollection.java:114) ~[elasticsearch-7.17.26.jar:7.17.26]
        ... 10 more

Internet search for "cannot reset recovery as previous attempt made it past finalization step" gives no results for my profile.

Updates

Relocating another index from Cold back to Warm works fine
Relocating another index from different Warm node to Cold works fine
Relocating affected index from Warm node to different warm node fails
Relocating affected shard to Cold still fails today
Relocating different shard of same index to affected Warm node works
Relocating this different shard to other Warm node works
Relocating original affected shard to same Warn node fails

nisow95612 · January 8, 2025, 12:41pm

Looks like it is currently about 4 shards stuck like this. This is more that outgoing recovery limit, so there is always 2-3 shards transfering and one waiting. It looks like because of this each transfer starts from zero?

I override all but one back to warm tier and now it looks like cold node can keep transfered data between attemps. This speeds up loop to like 10 attempts per second. Error is same.

nisow95612 · January 9, 2025, 11:12am

So it looks like it is particular shards that are affected by this.
Other shards of the same index and other indexes move fine.

nisow95612 · January 10, 2025, 3:03pm

Decided to reboot node hosting affected shards.

No errors in logs but replicas did not want to synchronize and stuck in INITIALIZING state for more than normal. I was worried but after about 10 minutes it finally synchronized.

Trying to move originally affected shard again now

nisow95612 · January 13, 2025, 9:59am

Good news. After that restart indexes move with no problems.

jessgarson · January 13, 2025, 11:09pm

Thanks so much for sharing these troubleshooting details, @nisow95612.

Topic		Replies	Views
Constant Recovering and Unassigned shards for an index Elasticsearch	12	1022	July 6, 2017
How to restart/recover a shard? Elasticsearch	24	1377	July 6, 2017
Problems upgrading to 1.5.0 Elasticsearch	1	420	July 6, 2017
Recovery failed for shard Elasticsearch	1	527	July 6, 2017
ES 5.1 - Shard recovery stuck in INIT- IndexShardRelocatedException: Already relocated Elasticsearch	2	834	November 3, 2017

Index relocation failing at 100% bytes

Related topics