Shard reallocation stops

Very confused here. Just upgraded the RAM in one node in a 2 node cluster. Node was shut down cleanly and brought up cleanly after the upgrade. When the node is started, shards start reallocating from the other node. However, it gets to 23 shards remaining and stops. There's nothing I can see that is special about the shards that don't reallocate - they vary in size from about 1MB to 100GB. I have tried disabling and re-enabling cluster.routing.allocation.enable. Nothing in the Elasticsearch logs on either of the two nodes.

I have restarted the affected node a couple of times - each time it gets stuck on the same 23 shards.

Running ES 5.5.2 on both nodes.

Any clues?

Two nodes is bad, see https://www.elastic.co/guide/en/elasticsearch/guide/2.x/important-configuration-changes.html#_minimum_master_nodes

What does _cat/pending_tasks show?

No output from _cat/pending_tasks

What about _cat/allocation and _cat/shards?

Output was too big for the forums. It's here: https://gist.github.com/anonymous/d854ee8f9f6e396d29fd4cf3e303eb34

Thanks. Looks like you could reduce the shard count to 1-2 without too much worry, do that in the template.

That aside these are replicas, so shouldn't be causing a major issue other than not being allocated. Can you run an explain on a shard to see what it says? https://www.elastic.co/guide/en/elasticsearch/reference/5.6/cluster-allocation-explain.html

That looks more helpful:

{
  "index" : "logstash-api-pns-2017.38",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "AKGKzJ2mQqa-26zbPlEsFw",
    "name" : "elasticsearch-02",
    "transport_address" : "10.20.4.145:9300",
    "weight_ranking" : 2
  },
  "can_remain_on_current_node" : "yes",
  "can_rebalance_cluster" : "no",
  "can_rebalance_cluster_decisions" : [
    {
      "decider" : "rebalance_only_when_active",
      "decision" : "NO",
      "explanation" : "rebalancing is not allowed until all replicas in the cluster are active"
    }
  ],
  "can_rebalance_to_other_node" : "no",
  "rebalance_explanation" : "rebalancing is not allowed, even though there is at least one node on which the shard can be allocated",
  "node_allocation_decisions" : [
    {
      "node_id" : "EDv6io2GRM6d5WfaA0EKwA",
      "node_name" : "elasticsearch-01",
      "transport_address" : "10.20.4.143:9300",
      "node_decision" : "yes",
      "weight_ranking" : 1
    }
  ]
}

Also, thanks for the advice regarding 2 nodes. I've made another non-data node Master eligible, which has balanced us at 3 master eligible nodes.

Not sure what's happening to be honest.
Can you drop the replica set for logstash-api-pns-2017.38 and re-add it to see if it completes?

Ah! I changed number_of_replicas to 0 on one of the much smaller indexes, then back to 1. Those shards are distributing now. I will do all affected indexes and see if this has the same effect.

I need to upgrade the RAM in the second node at some point so we can see if this happens again on that.

I now have this error on the last shard:

org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-api-pns-2017.38][0]: Recovery failed from {elasticsearch-02}{AKGKzJ2mQqa-26zbPlEsFw}{1RXza1r5Q8Syxx924B_6cQ}{10.20.4.145}{10.20.4.145:9300} into {elasticsearch-01}{EDv6io2GRM6d5WfaA0EKwA}{o-ljKzafTMmM4qDqYxYELw}{10.20.4.143}{10.20.4.143:9300}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:314) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:73) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:556) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.2.jar:5.5.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-02][10.20.4.145:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.2.jar:5.5.2]
        ... 5 more
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.2.jar:5.5.2]
        ... 5 more
Caused by: java.lang.IllegalStateException: try to recover [logstash-api-pns-2017.38][0] from primary shard with sync id but number of docs differ: 123810358 (elasticsearch-02, primary) vs 123810384(elasticsearch-01)
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:226) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.2.jar:5.5.2]
        ... 5 more

I see this referenced at https://github.com/elastic/elasticsearch/issues/12661 so I will set replicas to 0, leave it over night to sort itself out and try setting to 1 again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.