Empty .kibana index cannot be relocated after node exlusion has been set

Dear support,

We have met a shard relocation issue after setting node exclusion. In our case, original cluster is 6.8.2, we try to add the same amount of new 7.5.1 nodes and exclude the 6.8.2 nodes to upgrade cluster.

However, after adding 7.5.1 nodes, and set exclude 6.8.2 nodes in cluster setting, one of the single empty .kibana index shard cannot be relocated success, we have met this issue in several times.

Here is the node list after adding new nodes, we could see 4 6.8.2 nodes and 4 7.5.1 nodes:
[c_log@VM_1_14_centos ~/repository]$ curl "localhost:9200/_cat/nodes?h=version,name,node.role&s=version"
6.8.2 1590650188002472432 dmi
6.8.2 1590650188002472632 dmi
6.8.2 1590650188002472732 dmi
6.8.2 1590650188002472532 dmi
7.5.1 1590650759002483032 dmi
7.5.1 1590650759002483132 dmi
7.5.1 1590650759002482832 dmi
7.5.1 1590650759002482932 dmi

And we set this cluster setting to exclude data from 6.8.2:

"transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "node_concurrent_recoveries" : "10",
          "exclude" : {
            "_name" : "1590650188002472632,1590650188002472732,1590650188002472432,1590650188002472532"
          }
        }
      }
    }

The cluster is empty and only contains kibana index. We could see the single internal .kibana_1 system index and it contains nothing docs:
[c_log@VM_1_14_centos ~]$ curl localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open .kibana_1 5nRyca57QeaIN4O_SerQ7g 1 1 0 0 522b 261b

Finally, the shard 0 replica cannot be relocated to the new node:
[c_log@VM_1_14_centos ~]$ curl localhost:9200/_cat/shards?v
index shard prirep state docs store ip node
.kibana_1 0 p STARTED 0 261b 10.0.0.82 1590650759002483132 (relocated success)
.kibana_1 0 r STARTED 0 261b 10.0.0.148 1590650188002472732 (fail shard, it should be relocated)

On the master and target node, we could see this exception, no exception on source node:

 [2020-05-28T15:26:59,295][WARN ][o.e.i.c.IndicesClusterStateService] [1590650759002483032] [.kibana_1][0] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [.kibana_1][0]: Recovery failed from {1590650759002483132}{o5bJB_gPT6WiEDdt0l-v0Q}{AaaWa5nTQOuOwacnaE5xpA}{10.0.0.82}{10.0.0.82:20839}{di}{temperature=hot, rack=cvm_1_100003, set=100003, region=1, ip=9.10.49.143} into {1590650759002483032}{l42RGM6tSz-3-Dquma5OzQ}{ZZSxRSWXQjOFlETue5UHxQ}{10.0.0.205}{10.0.0.205:29559}{di}{rack=cvm_1_100003, set=100003, ip=9.10.48.33, temperature=hot, region=1}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:247) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:292) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1120) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:259) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.1.jar:7.5.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_181]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_181]
Caused by: org.elasticsearch.transport.RemoteTransportException: [1590650759002483132][10.0.0.82:20839][internal:index/shard/recovery/start_recovery]
Caused by: java.lang.IllegalStateException: can't move recovery to stage [FINALIZE]. current stage: [INDEX] (expected [TRANSLOG])
        at org.elasticsearch.indices.recovery.RecoveryState.validateAndSetStage(RecoveryState.java:175) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.indices.recovery.RecoveryState.setStage(RecoveryState.java:206) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.index.shard.IndexShard.finalizeRecovery(IndexShard.java:1718) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$finalizeRecovery$1(RecoveryTarget.java:313) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:285) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.indices.recovery.RecoveryTarget.finalizeRecovery(RecoveryTarget.java:294) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FinalizeRecoveryRequestHandler.messageReceived(PeerRecoveryTargetService.java:395) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FinalizeRecoveryRequestHandler.messageReceived(PeerRecoveryTargetService.java:389) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:280) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.1.jar:7.5.1]
        ... 3 more

The cluster is in green status after relocating failed. And just the shard cannot be relocated.

Any idea that why translog operation has been skipped and caused recovery failed? It could not be easily re-produced. Thanks a lot.

Could anyone please help to check this issue?

Thanks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.