Shard reallocation stops


(Phil Lavin) #1

Very confused here. Just upgraded the RAM in one node in a 2 node cluster. Node was shut down cleanly and brought up cleanly after the upgrade. When the node is started, shards start reallocating from the other node. However, it gets to 23 shards remaining and stops. There's nothing I can see that is special about the shards that don't reallocate - they vary in size from about 1MB to 100GB. I have tried disabling and re-enabling cluster.routing.allocation.enable. Nothing in the Elasticsearch logs on either of the two nodes.

I have restarted the affected node a couple of times - each time it gets stuck on the same 23 shards.

Running ES 5.5.2 on both nodes.

Any clues?


(Mark Walkom) #2

Two nodes is bad, see https://www.elastic.co/guide/en/elasticsearch/guide/2.x/important-configuration-changes.html#_minimum_master_nodes

What does _cat/pending_tasks show?


(Phil Lavin) #3

No output from _cat/pending_tasks


(Mark Walkom) #4

What about _cat/allocation and _cat/shards?


(Phil Lavin) #5

Output was too big for the forums. It's here: https://gist.github.com/anonymous/d854ee8f9f6e396d29fd4cf3e303eb34


(Mark Walkom) #6

Thanks. Looks like you could reduce the shard count to 1-2 without too much worry, do that in the template.

That aside these are replicas, so shouldn't be causing a major issue other than not being allocated. Can you run an explain on a shard to see what it says? https://www.elastic.co/guide/en/elasticsearch/reference/5.6/cluster-allocation-explain.html


(Phil Lavin) #7

That looks more helpful:

{
  "index" : "logstash-api-pns-2017.38",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "AKGKzJ2mQqa-26zbPlEsFw",
    "name" : "elasticsearch-02",
    "transport_address" : "10.20.4.145:9300",
    "weight_ranking" : 2
  },
  "can_remain_on_current_node" : "yes",
  "can_rebalance_cluster" : "no",
  "can_rebalance_cluster_decisions" : [
    {
      "decider" : "rebalance_only_when_active",
      "decision" : "NO",
      "explanation" : "rebalancing is not allowed until all replicas in the cluster are active"
    }
  ],
  "can_rebalance_to_other_node" : "no",
  "rebalance_explanation" : "rebalancing is not allowed, even though there is at least one node on which the shard can be allocated",
  "node_allocation_decisions" : [
    {
      "node_id" : "EDv6io2GRM6d5WfaA0EKwA",
      "node_name" : "elasticsearch-01",
      "transport_address" : "10.20.4.143:9300",
      "node_decision" : "yes",
      "weight_ranking" : 1
    }
  ]
}

(Phil Lavin) #8

Also, thanks for the advice regarding 2 nodes. I've made another non-data node Master eligible, which has balanced us at 3 master eligible nodes.


(Mark Walkom) #9

Not sure what's happening to be honest.
Can you drop the replica set for logstash-api-pns-2017.38 and re-add it to see if it completes?


(Phil Lavin) #10

Ah! I changed number_of_replicas to 0 on one of the much smaller indexes, then back to 1. Those shards are distributing now. I will do all affected indexes and see if this has the same effect.

I need to upgrade the RAM in the second node at some point so we can see if this happens again on that.


(Phil Lavin) #11

I now have this error on the last shard:

org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-api-pns-2017.38][0]: Recovery failed from {elasticsearch-02}{AKGKzJ2mQqa-26zbPlEsFw}{1RXza1r5Q8Syxx924B_6cQ}{10.20.4.145}{10.20.4.145:9300} into {elasticsearch-01}{EDv6io2GRM6d5WfaA0EKwA}{o-ljKzafTMmM4qDqYxYELw}{10.20.4.143}{10.20.4.143:9300}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:314) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:73) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:556) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.2.jar:5.5.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-02][10.20.4.145:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.2.jar:5.5.2]
        ... 5 more
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.2.jar:5.5.2]
        ... 5 more
Caused by: java.lang.IllegalStateException: try to recover [logstash-api-pns-2017.38][0] from primary shard with sync id but number of docs differ: 123810358 (elasticsearch-02, primary) vs 123810384(elasticsearch-01)
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:226) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) ~[elasticsearch-5.5.2.jar:5.5.2]
        ... 5 more

I see this referenced at https://github.com/elastic/elasticsearch/issues/12661 so I will set replicas to 0, leave it over night to sort itself out and try setting to 1 again.


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.