UNASSIGNED NODE_LEFT

We have a cluster of 30 nodes (3 phys servers 64 cpu + 512 mem with 10 ES instances on each (1 master + 9 data)).
Sometimes after uploading a new index (~700 Gb 24 shards * 2 rep factor) we have some instances crushed (3-4). And we must to start them manually. After that index has some replica shards in UNASSIGNED state NODE_LEFT with "allocate_explanation" : "Elasticsearch is retrieving information about this shard from one or more nodes. It will make an allocation decision after it receives this information. Please wait.".
How can i understand "what info and from who cluster is waiting"?
Api method POST _cluster/reroute?retry_failed=true didnt help us.

PS: we have enabled routing awarness
in masters config:

cluster.routing.allocation.awareness.attributes: rack_id

in data node config:

node.attr.rack_id: <phys server hostname>

Persistent cluster settings (force awareness + same_shard):

{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation": {
          "awareness": {
            "force": {
              "rack_id": {
                "values": "srv1_name,srv2_name,srv3_name"
              }
            }
          },
          "same_shard": {
            "host": "true"
          }
        }
      }
    }
  },
  "transient": {}
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.