ES Cluster Recovery and Restart

Greetings All,

I have a new 6-node ES 0.20.1 cluster. The cluster was up and running with
a small amount of data. There were a total of 302 shards (151 primary,
with 1 copy) distributed amongst 4 indices.

We shut down the cluster so that we could set the ES_HEAP_SIZE. After that
we started the cluster. Ever since then it has been in a red state with 30
unassigned shards. We tried tweaking the gateway settings (
http://www.elasticsearch.org/guide/reference/modules/gateway/) and
restarting but that didn't help. It has been a full day now with no change
in status.

My thoughts are to do the following:

  1. Grep the logs on all of the machines to collect all of the IndexShardMissingException
    error messages.
  2. Shut down the cluster.
  3. See if I can find good copies of the missing shards and, if so, copy
    those shard files to the appropriate index directory where the log error
    messages reported a problem.
  4. Start the cluster and see if it can recover.

Is this a viable plan? Fortunately this is a test cluster so it is not the
end of the world if I have to wipe it and start over. But I want to
understand proper error recovery so I know what to do if this happens in
production.

Also, are there any additional procedural steps I should follow in the
event that I have to restart a cluster so as to avoid this type of issue in
the future?

Many thanks for any information!

Best Regards,

--gordon

--

Hello Gordon,

Your plan sounds OK to me, but you have to take care of your recovery
settings. You may have already done this, but I'd put something like:

gateway:
recover_after_nodes: 5
recover_after_time: 5m
expected_nodes: 6

To make sure it starts recovery after you've got at least one copy of each
shard available.

Also, if all your nodes are master-eligible, which is the default, you
might want to set discovery.zen.minimum_master_nodes to 4 (nodes/2 +1), so
you make sure you don't get split-brain when you restart the whole cluster.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Wed, Jan 9, 2013 at 7:13 PM, Gordon Tillman gordyt@gmail.com wrote:

Greetings All,

I have a new 6-node ES 0.20.1 cluster. The cluster was up and running
with a small amount of data. There were a total of 302 shards (151
primary, with 1 copy) distributed amongst 4 indices.

We shut down the cluster so that we could set the ES_HEAP_SIZE. After
that we started the cluster. Ever since then it has been in a red state
with 30 unassigned shards. We tried tweaking the gateway settings (
Elasticsearch Platform — Find real-time answers at scale | Elastic) and
restarting but that didn't help. It has been a full day now with no change
in status.

My thoughts are to do the following:

  1. Grep the logs on all of the machines to collect all of the IndexShardMissingException
    error messages.
  2. Shut down the cluster.
  3. See if I can find good copies of the missing shards and, if so,
    copy those shard files to the appropriate index directory where the log
    error messages reported a problem.
  4. Start the cluster and see if it can recover.

Is this a viable plan? Fortunately this is a test cluster so it is not
the end of the world if I have to wipe it and start over. But I want to
understand proper error recovery so I know what to do if this happens in
production.

Also, are there any additional procedural steps I should follow in the
event that I have to restart a cluster so as to avoid this type of issue in
the future?

Many thanks for any information!

Best Regards,

--gordon

--

--

Radu thank you very much for the information. I appreciate your time and
trouble.

--gordon

On Thursday, January 10, 2013 5:56:02 AM UTC-6, Radu Gheorghe wrote:

Hello Gordon,

Your plan sounds OK to me, but you have to take care of your recovery
settings. You may have already done this, but I'd put something like:

gateway:

recover_after_nodes: 5
recover_after_time: 5m

expected_nodes: 6

To make sure it starts recovery after you've got at least one copy of each
shard available.

Also, if all your nodes are master-eligible, which is the default, you
might want to set discovery.zen.minimum_master_nodes to 4 (nodes/2 +1),
so you make sure you don't get split-brain when you restart the whole
cluster.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Wed, Jan 9, 2013 at 7:13 PM, Gordon Tillman <gor...@gmail.com<javascript:>

wrote:

Greetings All,

I have a new 6-node ES 0.20.1 cluster. The cluster was up and running
with a small amount of data. There were a total of 302 shards (151
primary, with 1 copy) distributed amongst 4 indices.

We shut down the cluster so that we could set the ES_HEAP_SIZE. After
that we started the cluster. Ever since then it has been in a red state
with 30 unassigned shards. We tried tweaking the gateway settings (
Elasticsearch Platform — Find real-time answers at scale | Elastic) and
restarting but that didn't help. It has been a full day now with no change
in status.

My thoughts are to do the following:

  1. Grep the logs on all of the machines to collect all of the IndexShardMissingException
    error messages.
  2. Shut down the cluster.
  3. See if I can find good copies of the missing shards and, if so,
    copy those shard files to the appropriate index directory where the log
    error messages reported a problem.
  4. Start the cluster and see if it can recover.

Is this a viable plan? Fortunately this is a test cluster so it is not
the end of the world if I have to wipe it and start over. But I want to
understand proper error recovery so I know what to do if this happens in
production.

Also, are there any additional procedural steps I should follow in the
event that I have to restart a cluster so as to avoid this type of issue in
the future?

Many thanks for any information!

Best Regards,

--gordon

--

--