ES Cluster Recovery and Restart

Gordon_Tillman · January 9, 2013, 5:13pm

Greetings All,

I have a new 6-node ES 0.20.1 cluster. The cluster was up and running with
a small amount of data. There were a total of 302 shards (151 primary,
with 1 copy) distributed amongst 4 indices.

We shut down the cluster so that we could set the ES_HEAP_SIZE. After that
we started the cluster. Ever since then it has been in a red state with 30
unassigned shards. We tried tweaking the gateway settings (
http://www.elasticsearch.org/guide/reference/modules/gateway/) and
restarting but that didn't help. It has been a full day now with no change
in status.

My thoughts are to do the following:

Grep the logs on all of the machines to collect all of the IndexShardMissingException
error messages.
Shut down the cluster.
See if I can find good copies of the missing shards and, if so, copy
those shard files to the appropriate index directory where the log error
messages reported a problem.
Start the cluster and see if it can recover.

Is this a viable plan? Fortunately this is a test cluster so it is not the
end of the world if I have to wipe it and start over. But I want to
understand proper error recovery so I know what to do if this happens in
production.

Also, are there any additional procedural steps I should follow in the
event that I have to restart a cluster so as to avoid this type of issue in
the future?

Many thanks for any information!

Best Regards,

--gordon

--

radu_gheorghe · January 10, 2013, 11:56am

Hello Gordon,

Your plan sounds OK to me, but you have to take care of your recovery
settings. You may have already done this, but I'd put something like:

gateway:
recover_after_nodes: 5
recover_after_time: 5m
expected_nodes: 6

To make sure it starts recovery after you've got at least one copy of each
shard available.

Also, if all your nodes are master-eligible, which is the default, you
might want to set discovery.zen.minimum_master_nodes to 4 (nodes/2 +1), so
you make sure you don't get split-brain when you restart the whole cluster.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Wed, Jan 9, 2013 at 7:13 PM, Gordon Tillman gordyt@gmail.com wrote:

Greetings All,

I have a new 6-node ES 0.20.1 cluster. The cluster was up and running
with a small amount of data. There were a total of 302 shards (151
primary, with 1 copy) distributed amongst 4 indices.

We shut down the cluster so that we could set the ES_HEAP_SIZE. After
that we started the cluster. Ever since then it has been in a red state
with 30 unassigned shards. We tried tweaking the gateway settings (
Elasticsearch Platform — Find real-time answers at scale | Elastic) and
restarting but that didn't help. It has been a full day now with no change
in status.

My thoughts are to do the following:

Grep the logs on all of the machines to collect all of the IndexShardMissingException
error messages.

Shut down the cluster.

See if I can find good copies of the missing shards and, if so,
copy those shard files to the appropriate index directory where the log
error messages reported a problem.

Start the cluster and see if it can recover.

Is this a viable plan? Fortunately this is a test cluster so it is not
the end of the world if I have to wipe it and start over. But I want to
understand proper error recovery so I know what to do if this happens in
production.

Also, are there any additional procedural steps I should follow in the
event that I have to restart a cluster so as to avoid this type of issue in
the future?

Many thanks for any information!

Best Regards,

--gordon

--

--

Gordon_Tillman · January 10, 2013, 2:10pm

Radu thank you very much for the information. I appreciate your time and
trouble.

--gordon

On Thursday, January 10, 2013 5:56:02 AM UTC-6, Radu Gheorghe wrote:

Hello Gordon,

Your plan sounds OK to me, but you have to take care of your recovery
settings. You may have already done this, but I'd put something like:

gateway:
recover_after_nodes: 5
recover_after_time: 5m

expected_nodes: 6
To make sure it starts recovery after you've got at least one copy of each
shard available.

Also, if all your nodes are master-eligible, which is the default, you
might want to set discovery.zen.minimum_master_nodes to 4 (nodes/2 +1),
so you make sure you don't get split-brain when you restart the whole
cluster.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Wed, Jan 9, 2013 at 7:13 PM, Gordon Tillman <gor...@gmail.com<javascript:>

wrote:

Greetings All,

I have a new 6-node ES 0.20.1 cluster. The cluster was up and running
with a small amount of data. There were a total of 302 shards (151
primary, with 1 copy) distributed amongst 4 indices.

We shut down the cluster so that we could set the ES_HEAP_SIZE. After
that we started the cluster. Ever since then it has been in a red state
with 30 unassigned shards. We tried tweaking the gateway settings (
Elasticsearch Platform — Find real-time answers at scale | Elastic) and
restarting but that didn't help. It has been a full day now with no change
in status.

My thoughts are to do the following:

Grep the logs on all of the machines to collect all of the IndexShardMissingException
error messages.

Shut down the cluster.

See if I can find good copies of the missing shards and, if so,
copy those shard files to the appropriate index directory where the log
error messages reported a problem.

Start the cluster and see if it can recover.

Is this a viable plan? Fortunately this is a test cluster so it is not
the end of the world if I have to wipe it and start over. But I want to
understand proper error recovery so I know what to do if this happens in
production.

Also, are there any additional procedural steps I should follow in the
event that I have to restart a cluster so as to avoid this type of issue in
the future?

Many thanks for any information!

Best Regards,

--gordon

--

--

Topic		Replies	Views
Rebuild Cluster - Unassigned shards Elasticsearch	4	770	December 15, 2016
Lost data after restarting ES Elasticsearch	1	391	July 6, 2017
ES Ate My Shards/Indexes Elasticsearch	13	532	July 6, 2017
Unnassigned Shards After Node Restart Elasticsearch	3	519	July 5, 2017
Lost whole data after restarting ES Elasticsearch	2	1730	July 6, 2017

ES Cluster Recovery and Restart

Best regards, Radu

Best regards, Radu

Related topics

Best regards,
Radu

Best regards,
Radu