Why do I loose data?

Attila_Nagy · October 11, 2016, 12:27pm

I have set up a cluster from 20 nodes with several indexes and shards. Each shard has one or two replicas set.
The cluster is quiet, in the means that it doesn't receive any queries.

In this state, I start to kill nodes with a random timeout and immediately restart them. After some minutes, I stop the kills and wait for all nodes to come up.
Some shards remain in "red" state and the cluster health says I have many unassigned shards.
I have some questions:

what can I do in this case to regain the lost shards?
what can I do to not loose shards?

_cluster/reroute?explain returns this for an unavailable shard:
"messages_12" : {
"shards" : {
"0" : [
{
"state" : "UNASSIGNED",
"primary" : true,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "messages_12",
"recovery_source" : {
"type" : "EXISTING_STORE"
},
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-10-11T11:28:07.442Z",
"delayed" : false,
"allocation_status" : "no_valid_shard_copy"
}
},
{
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "messages_12",
"recovery_source" : {
"type" : "PEER"
},
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-10-11T11:28:07.442Z",
"delayed" : false,
"allocation_status" : "no_attempt"
}
},
{
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "messages_12",
"recovery_source" : {
"type" : "PEER"
},
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2016-10-11T11:28:07.442Z",
"delayed" : false,
"allocation_status" : "no_attempt"
}
}
]
}

ywelsch · October 11, 2016, 1:07pm

This is 5.0.0-beta or rc1, right? Can you please share the output of

curl -XGET 'http://localhost:9200/_shard_stores?pretty'

and

curl -XGET 'http://localhost:9200/_cluster/state?pretty'

so I can have a closer look at the state the cluster ended up in.

Also, have you correctly configured the discovery.zen.minimum_master_nodes setting?

Attila_Nagy · October 12, 2016, 7:19pm

Yes, beta1. And indeed, a split brain could occur due to an error (minimum_master_nodes wasn't enough for a quorum, every data node was also a master node because of a wrong configuration entry).
Sorry for the noise, I re-did the test with the correct setting and could not see this behaviour.

Topic		Replies	Views
Why shard unassigned after cluster restart completely? Elasticsearch	1	384	May 28, 2020
Unassigned missng shards after node failure Elasticsearch	1	249	February 18, 2023
Unassigned shards on cluster restart Elasticsearch	1	683	October 2, 2018
Unassigned shards after node leave even when node is excluded Elasticsearch	1	899	March 2, 2019
SOLVED: Unassigned shards after restart, allocated for local recovery, should exist but doesn't + no segments file found in store Elasticsearch	2	4143	July 5, 2017

Why do I loose data?

Related topics