Non-recovering index shard


(Pat Christopher) #1

Hey guys,
After Mondays AWS-EC2 failbasket I ended up with splitbrain and a number of
rather strange configurations. I've cleaned out those with a judicious use
of kill and restarting the ES nodes.

My index has 5 shards and one replica on six data nodes. After my kill
fiesta all five shards were yellow or red. Four of them have come back to
green and are accepting writes again. The fifth shard has stubbornly
remained at yellow even after closing and opening the index. It claims to
have one active shard and one initializing shard. Its been initializing for
about 20 hours now and I don't think its going to finish. When the other
shards were initializing there was a tremendous amount of disk activity, now
there is nothing spectacular going on.

  1. how can I kick the last initializing shard to work? will an increase in
    shards cause it to rebalance and fix itself or will that only cause more
    problems?
  2. two of the six data nodes have no data on them. I'm not entirely what
    they're doing but I'd like to get them involved. Any suggestions on how to
    push indicies onto them? Possibly at the same time as fixing the busted
    shard?

Thanks,
Pat


(Pat Christopher) #2

I've increased the number of replicas from 1 to 2. This has caused the data
to be spread out over all nodes. I did some more research and found that
no, I can't change the number of shards for an index after it was created.

However, one shard still has a replica which is stuck in initializing. Any
idea how I can get ES to abandon that replica and try again someplace else?
Or if turn the number of replicas down from 2 to 1 will ES kill the replica
that is still initializing and it will be purged?

Pat


(Shay Banon) #3
  1. two of the six data nodes have no data on them. I'm not entirely what
    they're doing but I'd like to get them involved. Any suggestions on how to
    push indicies onto them? Possibly at the same time as fixing the busted
    shard?

The reason is that rebalancing will not start until the cluster is green in
order to reduce the number of relocations.

Regarding the stuck shard that is stuck initializing, thats strange... . You
have several options here, the simplest would be to bring down the node the
replica shard is initializing on, and then start it back up. It should kick
the recovery back. Another option is to reduce the replicas to 1, but it
won't necessarily remove that initializing shard.

Btw, which version are you using?

On Thu, Aug 11, 2011 at 1:33 AM, Pat Christopher <
pat.christopher.hp@gmail.com> wrote:

I've increased the number of replicas from 1 to 2. This has caused the
data to be spread out over all nodes. I did some more research and found
that no, I can't change the number of shards for an index after it was
created.

However, one shard still has a replica which is stuck in initializing. Any
idea how I can get ES to abandon that replica and try again someplace else?
Or if turn the number of replicas down from 2 to 1 will ES kill the replica
that is still initializing and it will be purged?

Pat


(Pat Christopher) #4

rebalancing only when green: makes sense. thanks.

There was something wrong with that node as a whole, it had two permanently
initializing shards after the replica increase. I've shut it down and the
cluster has gone green. It had this message over and over in the log file:

[Beetle] master should not receive new cluster state from [[O'Meggan, 

Alfie]

The bad node is Beetle and the master for the cluster is O'Meggan, Alfie.
Any idea what would cause this?

I'm using 0.17.2

Pat


(system) #5