Empty .kibana index cannot be relocated after node exlusion has been set

howardhuang · May 29, 2020, 7:39am

Dear support,

We have met a shard relocation issue after setting node exclusion. In our case, original cluster is 6.8.2, we try to add the same amount of new 7.5.1 nodes and exclude the 6.8.2 nodes to upgrade cluster.

However, after adding 7.5.1 nodes, and set exclude 6.8.2 nodes in cluster setting, one of the single empty .kibana index shard cannot be relocated success, we have met this issue in several times.

Here is the node list after adding new nodes, we could see 4 6.8.2 nodes and 4 7.5.1 nodes:
[c_log@VM_1_14_centos ~/repository]$ curl "localhost:9200/_cat/nodes?h=version,name,node.role&s=version"
6.8.2 1590650188002472432 dmi
6.8.2 1590650188002472632 dmi
6.8.2 1590650188002472732 dmi
6.8.2 1590650188002472532 dmi
7.5.1 1590650759002483032 dmi
7.5.1 1590650759002483132 dmi
7.5.1 1590650759002482832 dmi
7.5.1 1590650759002482932 dmi

And we set this cluster setting to exclude data from 6.8.2:

"transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "node_concurrent_recoveries" : "10",
          "exclude" : {
            "_name" : "1590650188002472632,1590650188002472732,1590650188002472432,1590650188002472532"
          }
        }
      }
    }

The cluster is empty and only contains kibana index. We could see the single internal .kibana_1 system index and it contains nothing docs:
[c_log@VM_1_14_centos ~]$ curl localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open .kibana_1 5nRyca57QeaIN4O_SerQ7g 1 1 0 0 522b 261b

Finally, the shard 0 replica cannot be relocated to the new node:
[c_log@VM_1_14_centos ~]$ curl localhost:9200/_cat/shards?v
index shard prirep state docs store ip node
.kibana_1 0 p STARTED 0 261b 10.0.0.82 1590650759002483132 (relocated success)
.kibana_1 0 r STARTED 0 261b 10.0.0.148 1590650188002472732 (fail shard, it should be relocated)

On the master and target node, we could see this exception, no exception on source node:

 [2020-05-28T15:26:59,295][WARN ][o.e.i.c.IndicesClusterStateService] [1590650759002483032] [.kibana_1][0] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [.kibana_1][0]: Recovery failed from {1590650759002483132}{o5bJB_gPT6WiEDdt0l-v0Q}{AaaWa5nTQOuOwacnaE5xpA}{10.0.0.82}{10.0.0.82:20839}{di}{temperature=hot, rack=cvm_1_100003, set=100003, region=1, ip=9.10.49.143} into {1590650759002483032}{l42RGM6tSz-3-Dquma5OzQ}{ZZSxRSWXQjOFlETue5UHxQ}{10.0.0.205}{10.0.0.205:29559}{di}{rack=cvm_1_100003, set=100003, ip=9.10.48.33, temperature=hot, region=1}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:247) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:292) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1120) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:259) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.1.jar:7.5.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_181]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_181]
Caused by: org.elasticsearch.transport.RemoteTransportException: [1590650759002483132][10.0.0.82:20839][internal:index/shard/recovery/start_recovery]
Caused by: java.lang.IllegalStateException: can't move recovery to stage [FINALIZE]. current stage: [INDEX] (expected [TRANSLOG])
        at org.elasticsearch.indices.recovery.RecoveryState.validateAndSetStage(RecoveryState.java:175) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.indices.recovery.RecoveryState.setStage(RecoveryState.java:206) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.index.shard.IndexShard.finalizeRecovery(IndexShard.java:1718) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$finalizeRecovery$1(RecoveryTarget.java:313) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:285) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.indices.recovery.RecoveryTarget.finalizeRecovery(RecoveryTarget.java:294) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FinalizeRecoveryRequestHandler.messageReceived(PeerRecoveryTargetService.java:395) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FinalizeRecoveryRequestHandler.messageReceived(PeerRecoveryTargetService.java:389) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:280) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.1.jar:7.5.1]
        ... 3 more

The cluster is in green status after relocating failed. And just the shard cannot be relocated.

Any idea that why translog operation has been skipped and caused recovery failed? It could not be easily re-produced. Thanks a lot.

howardhuang · June 5, 2020, 3:11am

Could anyone please help to check this issue?

Thanks.

system · July 3, 2020, 3:21am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Index relocation failing at 100% bytes Elasticsearch	5	55	January 13, 2025
Indices stuck in Relocating state after a new node joins the cluster Elasticsearch	1	1746	July 5, 2017
Shards stuck in relocating Elasticsearch	3	2802	July 5, 2017
Encountering Index shard gateway recovery exception while manually moving the shards across nodes Elasticsearch	3	336	July 6, 2017
Elasticsearch Not Freeing Disk Space After Shard Relocation Elastic Search	14	122	February 11, 2025

Empty .kibana index cannot be relocated after node exlusion has been set

Related topics