After struggling the whole day to recover as much as possible I
certainly know more about ES...
I was using a cluster with 7 nodes with 590 shards each configured to
have one replica.
The discovery.zen.minimum_master_nodes was set to 1 on all the nodes
(I set it to 2 now). Also the discovery.zen.ping.timeout: was 3
seconds which is not enough if the master goes into a condition like
described below.
I'm still not sure what happened but while trying to recover I
upgraded to 0.19.4. Things seemed to be better but still there was
something odd: many shards had index directories with no data in them.
I would stop the node (which was a slave), remove the entire node
directory and restart the node.
In ES 0.19 the slave cannot know anything about the shard if I remove
the entire directory. Still it would recreate the removed directory,
presumably because the master told it so. Then the master would spit
hundreds of errors/second like this:
[2012-06-11 17:50:33,255][WARN ][cluster.action.shard ] [inuit]
received shard failed for [ng0010305][1],
node[F_TMayYDRDeU0Kb2yOkaTA], [P], s[INITIALIZING], reason [Failed to
start shard, message [IndexShardGatewayRecoveryException[[ng0010305]
[1] shard allocated for local recovery (post api), should exists, but
doesn't]]]
It seems to me that the master is trying to impose a shard config onto
a slave, because it somehow thinks that the slave should contain that
shard. This certainly makes sense if the master is trying to replicate
a shard. However in this case, there is no copy of a shard and the
master becomes very unresponsive, maybe even only because it generates
too many error messages like the one above.
So it could be that the original error has been generated because the
cluster has been stopped while a replication was going on. Then when
the cluster came up again, it was without the node that contained the
good copy for the shard that was being replicated. So ES tried to
replicate the incomplete shard onto another node so it ended up with
two incomplete copies of the shard. To add to the misery, it could
well be that during this time there were some new documents added to
the index. Then when the original node that had the good copy of the
shard came up, ES asked it to remove the data since there were already
two nodes with good copies of data. Could this be a possible scenario?
In any case I think ES should not try to do anything with a shard that
has no valid copy.
Currently the cluster is in a somehow stable situation after shutting
it down and removing all the shards containing empty index
directories.
roxana
On Jun 11, 5:19 pm, jagdeep reach.jagd...@gmail.com wrote:
Whats there in the logs?
It must be saying dangling indexes i guess. It must have happened
because improper shard distribution across different nodes. Please
post configuration details(entries in yml)
Regards
jagdeep
On Jun 11, 5:34 pm, anghelutar anghelu...@gmail.com wrote:
Hello,
I just had a very similar problem (also with ES 0.18.6) with what is
described here:http://elasticsearch-users.115913.n3.nabble.com/ES-Ate-My-Shards-Inde...
I have 590 shards and no less than 224 of them have gone missing. The
index directories appear on disk but there is no data inside
All seemed to have been caused by a split-brain situation, the causes
of which I'm still analyzing.
Has there been any further investigation on what may have caused the
deletion of index data?
thanks for any hint,
Roxana