Recently upgraded from 0.19.12 to 0.20.5. A node went down over the
weekend in my 30 node cluster. I use ES as a DB so there were constant
writes while the node was down. Restarted the node and went RED.
[2013-03-10 15:28:22,209][WARN ][indices.cluster ] [moloches-m13b]
[stats][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[stats][0] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:122)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)
What is the correct way to recover?
I have full replication on for this index, and it is a tiny index (8
documents for total store size of 10k)
My desire is that ES should just take care of this, I have 29 other copies
why do I need to do anything? In a dream world ES would delete the
broken index and copy from somewhere else. What am I missing?
On Mon, 2013-03-11 at 06:07 -0700, Andy Wick wrote:
Recently upgraded from 0.19.12 to 0.20.5. A node went down over the
weekend in my 30 node cluster. I use ES as a DB so there were
constant writes while the node was down. Restarted the node and went
RED.
[2013-03-10 15:28:22,209][WARN ][indices.cluster ]
[moloches-m13b] [stats][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[stats][0] shard allocated for local recovery (post api), should
exists, but doesn't
Hmm, I wonder if there is a problem with the data store on that node?
Perhaps just delete the datastore for that index on that node, and
restart.
Stopped the node, deleted the directory, restarted the node and
then immediately two other nodes started having the same issue with the
same index. Eventually I just gave up and deleted the index (I should say
I deleted all the directories because -XDELETE on the index would just
hang, I guess because it was in RED.) Should this work, or am I missing
the point of replication? To me it seems like if one node is bad ES should
just clean up and copy over a good version automatically.
On Mon, 2013-03-11 at 08:30 -0700, Andy Wick wrote:
Stopped the node, deleted the directory, restarted the node and then
immediately two other nodes started having the same issue with the
same index. Eventually I just gave up and deleted the index (I should
say I deleted all the directories because -XDELETE on the index would
just hang, I guess because it was in RED.) Should this work, or am I
missing the point of replication? To me it seems like if one node is
bad ES should just clean up and copy over a good version
automatically.
Yes it should work.
Please can you open an issue giving a full description, plus the full
logs?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.