I have set up a cluster from 20 nodes with several indexes and shards. Each shard has one or two replicas set.
The cluster is quiet, in the means that it doesn't receive any queries.
In this state, I start to kill nodes with a random timeout and immediately restart them. After some minutes, I stop the kills and wait for all nodes to come up.
Some shards remain in "red" state and the cluster health says I have many unassigned shards.
I have some questions:
- what can I do in this case to regain the lost shards?
- what can I do to not loose shards?
_cluster/reroute?explain returns this for an unavailable shard:
"messages_12" : {
"shards" : {
"0" : [
{
"state" : "UNASSIGNED",
"primary" : true,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "messages_12",
"recovery_source" : {
"type" : "EXISTING_STORE"
},
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-10-11T11:28:07.442Z",
"delayed" : false,
"allocation_status" : "no_valid_shard_copy"
}
},
{
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "messages_12",
"recovery_source" : {
"type" : "PEER"
},
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-10-11T11:28:07.442Z",
"delayed" : false,
"allocation_status" : "no_attempt"
}
},
{
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "messages_12",
"recovery_source" : {
"type" : "PEER"
},
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-10-11T11:28:07.442Z",
"delayed" : false,
"allocation_status" : "no_attempt"
}
}
]
}