Why do I loose data?

I have set up a cluster from 20 nodes with several indexes and shards. Each shard has one or two replicas set.
The cluster is quiet, in the means that it doesn't receive any queries.

In this state, I start to kill nodes with a random timeout and immediately restart them. After some minutes, I stop the kills and wait for all nodes to come up.
Some shards remain in "red" state and the cluster health says I have many unassigned shards.
I have some questions:

  • what can I do in this case to regain the lost shards?
  • what can I do to not loose shards?

_cluster/reroute?explain returns this for an unavailable shard:
"messages_12" : {
"shards" : {
"0" : [
{
"state" : "UNASSIGNED",
"primary" : true,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "messages_12",
"recovery_source" : {
"type" : "EXISTING_STORE"
},
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-10-11T11:28:07.442Z",
"delayed" : false,
"allocation_status" : "no_valid_shard_copy"
}
},
{
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "messages_12",
"recovery_source" : {
"type" : "PEER"
},
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-10-11T11:28:07.442Z",
"delayed" : false,
"allocation_status" : "no_attempt"
}
},
{
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "messages_12",
"recovery_source" : {
"type" : "PEER"
},
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-10-11T11:28:07.442Z",
"delayed" : false,
"allocation_status" : "no_attempt"
}
}
]
}

This is 5.0.0-beta or rc1, right? Can you please share the output of

curl -XGET 'http://localhost:9200/_shard_stores?pretty'

and

curl -XGET 'http://localhost:9200/_cluster/state?pretty'

so I can have a closer look at the state the cluster ended up in.

Also, have you correctly configured the discovery.zen.minimum_master_nodes setting?

Yes, beta1. And indeed, a split brain could occur due to an error (minimum_master_nodes wasn't enough for a quorum, every data node was also a master node because of a wrong configuration entry).
Sorry for the noise, I re-did the test with the correct setting and could not see this behaviour.