Shard getting "stuck" during a rebalance


(Jamie Matthews) #1

Hi,

We have a problem with a node getting "stuck" during a rebalance. It's a
five-node cluster running ES 0.19.1 with a single large index split into
five shards, with a replication level of 2. One of the shards has been in
INITIALIZING state since yesterday, and the logs on that node show
org.elasticsearch.index.IndexShardMissingException.

Screenshot of the HEAD interface here: http://imgur.com/T7I4r - the "stuck"
shard is number 1 on the fifth node in the list.

My concern is that this shard now has only one "good" copy, and we are
concerned about losing the index if anything goes wrong with it (for
various reasons it's currently very difficult for us to reindex, this is
being worked on).

Will just stopping and restarting that node fix the issue? Is this likely
to cause any further problems?

Many thanks,

Jamie

--


(Radu Gheorghe) #2

Hello Jamie,

On Mon, Oct 22, 2012 at 12:20 PM, Jamie Matthews
jamie.matthews@gmail.com wrote:

Hi,

We have a problem with a node getting "stuck" during a rebalance. It's a
five-node cluster running ES 0.19.1 with a single large index split into
five shards, with a replication level of 2. One of the shards has been in
INITIALIZING state since yesterday, and the logs on that node show
org.elasticsearch.index.IndexShardMissingException.

If you use Local Gateway, which is the default, this issue is most
likely caused by either something being corrupted on the filesystem,
or, if you recently restarted your cluster, some inappropriate
recovery settings here:
http://www.elasticsearch.org/guide/reference/modules/gateway/local.html

With the default recovery settings and 4 nodes, if you restart your
whole cluster, all nodes might try to recover their shards from the
local gateway at the same time. This might lead to inconsistencies.

Screenshot of the HEAD interface here: http://imgur.com/T7I4r - the "stuck"
shard is number 1 on the fifth node in the list.

My concern is that this shard now has only one "good" copy, and we are
concerned about losing the index if anything goes wrong with it (for various
reasons it's currently very difficult for us to reindex, this is being
worked on).

Will just stopping and restarting that node fix the issue? Is this likely to
cause any further problems?

AFAIK, restarting a node in your situation will make the contents from
other nodes to be replicated to the restarted node. Since your other
nodes have "good" shards, the problem should get fixed.

The worst thing that could happen(tm) is that more/all shards on that
node to get faulty following the restart. Which isn't that much
different from the situation you're in now. You still have to find a
solution to what would basically be the same problem.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

--


(system) #3