Thats strange… . In 0.19, we have a better storage system for local
gateway, where state is stored within each index/shard, instead of globally
on the node level. I am still not sure what caused the data to be removed
though…, elasticsearch does not remove data on its own unless instructed
to.
On Saturday, February 18, 2012 at 12:58 AM, Kenneth Loafman wrote:
No data was deleted from the filesystem. What I found when I looked was
an empty directory where the shard should have been.
On Fri, Feb 17, 2012 at 4:35 PM, Shay Banon kimchy@gmail.com wrote:
What I meant by data deleted is that some data was deleted from the file
system by any chance? I suggest you start to delete the problematic indexes
that hold the problematic shards when the cluster is up.
On Friday, February 17, 2012 at 11:32 PM, Kenneth Loafman wrote:
Any ideas? I've shut down again, forced fsck on next boot, rebooted and
restarted. No real problems found, so we can rule that out. The logs are
too big to gist. What would you need from them if I could find it?
On Fri, Feb 17, 2012 at 1:55 PM, Kenneth Loafman kenneth@loafman.comwrote:
Hmm, something else is going on. Two of the indexes that were OK
originally are now showing IndexShardMissingException.
ES is indeed hungry!
On Fri, Feb 17, 2012 at 12:58 PM, Kenneth Loafman kenneth@loafman.comwrote:
Nothing was deleted manually or through curl. Is this recoverable at all?
What happened? Was this because of the split-cluster condition?
On Fri, Feb 17, 2012 at 12:54 PM, Shay Banon kimchy@gmail.com wrote:
This means that the shard was supposed to exist on that node, but it
can't be found, are you sure nothing was deleted?
On Friday, February 17, 2012 at 8:49 PM, Kenneth Loafman wrote:
Yes, a whole bunch of messages repeating like this:
[2012-02-17 18:48:00,586][WARN ][cluster.action.shard ] [Blindspot]
received shard failed for [co0198ca0694][1], node[o9nkBCsISKGt7P6acyshHQ],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[co0198ca0694][1] shard allocated for
local recovery (post api), should exists, but doesn't]]]
On Fri, Feb 17, 2012 at 12:45 PM, Shay Banon kimchy@gmail.com wrote:
Do you see anything in the logs? It seems like there are 8 initializing
shards.
On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:
Hi,
We're on ES 18.6 driving a 4-node cluster on RackSpace.
Last night we had a nework outage on two of the nodes and our 4-node
cluster morphed into 2 2-node clusters. I think that's what happened
anyway. We shut all 4 nodes down cleanly, brought them up one at a time
and the cluster reformed into one, however, it's sticking on getting out of
red.
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}
It will stay in this mode for a long time and the '_status' will show a
bunch of
"failures" : [ {
"index" : "co0181ca0711",
"shard" : 1,
"reason" : "BroadcastShardOperationFailedException[[co0181ca0711][1]
]; nested: RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]];
nested: IndexMissingException[[co0181ca0711] missing]; "
}, {
which seem to come and go, but not get initialized. With 2 shards and 1
replica, it seems that it should be able to recover the missing index from
the other shard or the replica, but it sticks at this point until I
manually delete what's left of the index. Was this due to the split-brain
issue or is this just a limitation of ES? Is there a way to recover the
missing index from the replica? How do I find the replicas?
...Thanks,
...Ken