Replica shard stuck in UNASSIGNED state

Hi, We are using ES 5.2 on a 4 node cluster. Some replica shards are stuck in UNASSIGNED state since last couple of hours. Below is the information about these shards from _cluster/state API.
Also, a lot of bulk write operations are getting timed out and _cat/shards API is getting stuck and not giving any response back, when sent from curl. Can someone please take a look. How can this be debugged further. Thanks in advance.

"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 3,
"index" : "cfileindex",
"recovery_source" : {
  "type" : "PEER"
},
"unassigned_info" : {
  "reason" : "ALLOCATION_FAILED",
  "at" : "2017-07-15T23:23:28.426Z",
  "failed_attempts" : 5,
  "delayed" : false,
  "details" : "master {130593345248}{4diFAMWcTL6N214ezq8yXA}{8nfDzbB_QMKZG41fd_-S6Q}{10.2.34.149}{10.2.34.149:25800} has not removed previously failed shard. resending shard failure",
  "allocation_status" : "no_attempt"
}




    {
      "state" : "UNASSIGNED",
      "primary" : false,
      "node" : null,
      "relocating_node" : null,
      "shard" : 8,
      "index" : "cfileindex",
      "recovery_source" : {
        "type" : "PEER"
      },
      "unassigned_info" : {
        "reason" : "ALLOCATION_FAILED",
        "at" : "2017-07-16T11:05:53.747Z",
        "failed_attempts" : 5,
        "delayed" : false,
        "details" : "failed recovery, failure RecoveryFailedException[[cfileindex][8]: Recovery failed from {130593347324}{JAvDtnPwSXuNl7AYLjbgsw}{7loDHDZJQta-Ws1ZmM4WBA}{10.2.34.165}{10.2.34.165:25800} into {130593342308}{TxkUgGT_QrmyaoM6x5U__g}{rnaZ_IV_TR-6ns48jskCjw}{10.2.34.155}{10.2.34.155:25800}]; nested: RemoteTransportException[[130593347324][10.2.34.165:25800][internal:index/shard/recovery/start_recovery]]; nested: ReceiveTimeoutTransportException[[130593342308][10.2.34.155:25800][internal:index/shard/recovery/finalize] request_id [178425] timed out after [1800000ms]]; ",
        "allocation_status" : "no_attempt"
      }
    }

You might have had a network hickup causing a shard to be unassigned and then quickly reassigned to the same node. The error you had suggest that by the time of re-assignment the node didn't yet clear it's old shard copy (we had some bugs in this area so I suggest you upgrade to 5.5). The master tried this 5 times and then gave up inorder to avoid poisonous situations and flooding the logs. You can try again to the shard using POST /_cluster/reroute?retry_failed - see https://www.elastic.co/guide/en/elasticsearch/reference/5.5/cluster-reroute.html#_retry_failed_shards

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.