Shard reallocation stops

phil.lavin · October 8, 2017, 3:34am

Very confused here. Just upgraded the RAM in one node in a 2 node cluster. Node was shut down cleanly and brought up cleanly after the upgrade. When the node is started, shards start reallocating from the other node. However, it gets to 23 shards remaining and stops. There's nothing I can see that is special about the shards that don't reallocate - they vary in size from about 1MB to 100GB. I have tried disabling and re-enabling cluster.routing.allocation.enable. Nothing in the Elasticsearch logs on either of the two nodes.

I have restarted the affected node a couple of times - each time it gets stuck on the same 23 shards.

Running ES 5.5.2 on both nodes.

Any clues?

warkolm · October 8, 2017, 3:39am

Two nodes is bad, see https://www.elastic.co/guide/en/elasticsearch/guide/2.x/important-configuration-changes.html#_minimum_master_nodes

What does _cat/pending_tasks show?

phil.lavin · October 9, 2017, 6:14am

No output from _cat/pending_tasks

warkolm · October 9, 2017, 6:15am

What about _cat/allocation and _cat/shards?

phil.lavin · October 9, 2017, 6:21am

Output was too big for the forums. It's here: https://gist.github.com/anonymous/d854ee8f9f6e396d29fd4cf3e303eb34

warkolm · October 9, 2017, 6:27am

Thanks. Looks like you could reduce the shard count to 1-2 without too much worry, do that in the template.

That aside these are replicas, so shouldn't be causing a major issue other than not being allocated. Can you run an explain on a shard to see what it says? https://www.elastic.co/guide/en/elasticsearch/reference/5.6/cluster-allocation-explain.html

phil.lavin · October 9, 2017, 7:47am

That looks more helpful:

{
  "index" : "logstash-api-pns-2017.38",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "AKGKzJ2mQqa-26zbPlEsFw",
    "name" : "elasticsearch-02",
    "transport_address" : "10.20.4.145:9300",
    "weight_ranking" : 2
  },
  "can_remain_on_current_node" : "yes",
  "can_rebalance_cluster" : "no",
  "can_rebalance_cluster_decisions" : [
    {
      "decider" : "rebalance_only_when_active",
      "decision" : "NO",
      "explanation" : "rebalancing is not allowed until all replicas in the cluster are active"
    }
  ],
  "can_rebalance_to_other_node" : "no",
  "rebalance_explanation" : "rebalancing is not allowed, even though there is at least one node on which the shard can be allocated",
  "node_allocation_decisions" : [
    {
      "node_id" : "EDv6io2GRM6d5WfaA0EKwA",
      "node_name" : "elasticsearch-01",
      "transport_address" : "10.20.4.143:9300",
      "node_decision" : "yes",
      "weight_ranking" : 1
    }
  ]
}

phil.lavin · October 9, 2017, 9:34am

Also, thanks for the advice regarding 2 nodes. I've made another non-data node Master eligible, which has balanced us at 3 master eligible nodes.

warkolm · October 9, 2017, 10:12am

Not sure what's happening to be honest.
Can you drop the replica set for logstash-api-pns-2017.38 and re-add it to see if it completes?

phil.lavin · October 10, 2017, 9:45am

Ah! I changed number_of_replicas to 0 on one of the much smaller indexes, then back to 1. Those shards are distributing now. I will do all affected indexes and see if this has the same effect.

I need to upgrade the RAM in the second node at some point so we can see if this happens again on that.

phil.lavin · October 10, 2017, 5:50pm

I now have this error on the last shard:

org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-api-pns-2017.38][0]: Recovery failed from {elasticsearch-02}{AKGKzJ2mQqa-26zbPlEsFw}{1RXza1r5Q8Syxx924B_6cQ}{10.20.4.145}{10.20.4.145:9300} into {elasticsearch-01}{EDv6io2GRM6d5WfaA0EKwA}{o-ljKzafTMmM4qDqYxYELw}{10.20.4.143}{10.20.4.143:9300}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:314) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:73) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:556) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.2.jar:5.5.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-02][10.20.4.145:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.2.jar:5.5.2]
        ... 5 more
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.2.jar:5.5.2]
        ... 5 more
Caused by: java.lang.IllegalStateException: try to recover [logstash-api-pns-2017.38][0] from primary shard with sync id but number of docs differ: 123810358 (elasticsearch-02, primary) vs 123810384(elasticsearch-01)
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:226) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.2.jar:5.5.2]
        ... 5 more

I see this referenced at https://github.com/elastic/elasticsearch/issues/12661 so I will set replicas to 0, leave it over night to sort itself out and try setting to 1 again.

system · November 7, 2017, 5:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shards fail to reallocate Elasticsearch	6	586	July 6, 2017
Shard Allocation Problem Elasticsearch	3	341	July 6, 2017
Shard recovery with only one node in the cluster Elasticsearch	3	656	July 6, 2017
Shard Allocation Problem Elasticsearch	3	733	July 6, 2017
Why 9 hour long shard reallocs when restarting one node in 2-node cluster? Elasticsearch	8	1290	July 5, 2017

Shard reallocation stops

Related topics