We have a cluster of 30 nodes (3 phys servers 64 cpu + 512 mem with 10 ES instances on each (1 master + 9 data)).
Sometimes after uploading a new index (~700 Gb 24 shards * 2 rep factor) we have some instances crushed (3-4). And we must to start them manually. After that index has some replica shards in UNASSIGNED state NODE_LEFT with "allocate_explanation" : "Elasticsearch is retrieving information about this shard from one or more nodes. It will make an allocation decision after it receives this information. Please wait.".
How can i understand "what info and from who cluster is waiting"?
Api method POST _cluster/reroute?retry_failed=true
didnt help us.
PS: we have enabled routing awarness
in masters config:
cluster.routing.allocation.awareness.attributes: rack_id
in data node config:
node.attr.rack_id: <phys server hostname>
Persistent cluster settings (force awareness + same_shard):
{
"persistent": {
"cluster": {
"routing": {
"allocation": {
"awareness": {
"force": {
"rack_id": {
"values": "srv1_name,srv2_name,srv3_name"
}
}
},
"same_shard": {
"host": "true"
}
}
}
}
},
"transient": {}
}