ES Ate My Shards/Indexes

Kenneth_Loafman · February 17, 2012, 6:34pm

Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node
cluster morphed into 2 2-node clusters. I think that's what happened
anyway. We shut all 4 nodes down cleanly, brought them up one at a time
and the cluster reformed into one, however, it's sticking on getting out of
red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a
bunch of

"failures" : [ {
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" : "BroadcastShardOperationFailedException[[co0181ca0711][1]

]; nested: RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]];
nested: IndexMissingException[[co0181ca0711] missing]; "
}, {

which seem to come and go, but not get initialized. With 2 shards and 1
replica, it seems that it should be able to recover the missing index from
the other shard or the replica, but it sticks at this point until I
manually delete what's left of the index. Was this due to the split-brain
issue or is this just a limitation of ES? Is there a way to recover the
missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

kimchy · February 17, 2012, 6:45pm

Do you see anything in the logs? It seems like there are 8 initializing shards.

On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:

Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node cluster morphed into 2 2-node clusters. I think that's what happened anyway. We shut all 4 nodes down cleanly, brought them up one at a time and the cluster reformed into one, however, it's sticking on getting out of red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a bunch of
"failures" : [ { 
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" : "BroadcastShardOperationFailedException[[co0181ca0711][1] ]; nested: RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]]; nested: IndexMissingException[[co0181ca0711] missing]; "
}, {
which seem to come and go, but not get initialized. With 2 shards and 1 replica, it seems that it should be able to recover the missing index from the other shard or the replica, but it sticks at this point until I manually delete what's left of the index. Was this due to the split-brain issue or is this just a limitation of ES? Is there a way to recover the missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

Kenneth_Loafman_2 · February 17, 2012, 6:49pm

Yes, a whole bunch of messages repeating like this:

[2012-02-17 18:48:00,586][WARN ][cluster.action.shard ] [Blindspot]
received shard failed for [co0198ca0694][1], node[o9nkBCsISKGt7P6acyshHQ],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[co0198ca0694][1] shard allocated for
local recovery (post api), should exists, but doesn't]]]

On Fri, Feb 17, 2012 at 12:45 PM, Shay Banon kimchy@gmail.com wrote:

Do you see anything in the logs? It seems like there are 8 initializing
shards.

On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:

Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node
cluster morphed into 2 2-node clusters. I think that's what happened
anyway. We shut all 4 nodes down cleanly, brought them up one at a time
and the cluster reformed into one, however, it's sticking on getting out of
red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a
bunch of
"failures" : [ {
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" : "BroadcastShardOperationFailedException[[co0181ca0711][1]
]; nested: RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]];
nested: IndexMissingException[[co0181ca0711] missing]; "
}, {

which seem to come and go, but not get initialized. With 2 shards and 1
replica, it seems that it should be able to recover the missing index from
the other shard or the replica, but it sticks at this point until I
manually delete what's left of the index. Was this due to the split-brain
issue or is this just a limitation of ES? Is there a way to recover the
missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

kimchy · February 17, 2012, 6:54pm

This means that the shard was supposed to exist on that node, but it can't be found, are you sure nothing was deleted?

On Friday, February 17, 2012 at 8:49 PM, Kenneth Loafman wrote:

Yes, a whole bunch of messages repeating like this:

[2012-02-17 18:48:00,586][WARN ][cluster.action.shard ] [Blindspot] received shard failed for [co0198ca0694][1], node[o9nkBCsISKGt7P6acyshHQ], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[co0198ca0694][1] shard allocated for local recovery (post api), should exists, but doesn't]]]

On Fri, Feb 17, 2012 at 12:45 PM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:
Do you see anything in the logs? It seems like there are 8 initializing shards.

On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:
Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node cluster morphed into 2 2-node clusters. I think that's what happened anyway. We shut all 4 nodes down cleanly, brought them up one at a time and the cluster reformed into one, however, it's sticking on getting out of red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a bunch of
"failures" : [ { 
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" : "BroadcastShardOperationFailedException[[co0181ca0711][1] ]; nested: RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]]; nested: IndexMissingException[[co0181ca0711] missing]; "
}, {
which seem to come and go, but not get initialized. With 2 shards and 1 replica, it seems that it should be able to recover the missing index from the other shard or the replica, but it sticks at this point until I manually delete what's left of the index. Was this due to the split-brain issue or is this just a limitation of ES? Is there a way to recover the missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

Kenneth_Loafman_2 · February 17, 2012, 6:58pm

Nothing was deleted manually or through curl. Is this recoverable at all?
What happened? Was this because of the split-cluster condition?

On Fri, Feb 17, 2012 at 12:54 PM, Shay Banon kimchy@gmail.com wrote:

This means that the shard was supposed to exist on that node, but it
can't be found, are you sure nothing was deleted?

On Friday, February 17, 2012 at 8:49 PM, Kenneth Loafman wrote:

Yes, a whole bunch of messages repeating like this:

[2012-02-17 18:48:00,586][WARN ][cluster.action.shard ] [Blindspot]
received shard failed for [co0198ca0694][1], node[o9nkBCsISKGt7P6acyshHQ],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[co0198ca0694][1] shard allocated for
local recovery (post api), should exists, but doesn't]]]

On Fri, Feb 17, 2012 at 12:45 PM, Shay Banon kimchy@gmail.com wrote:

Do you see anything in the logs? It seems like there are 8 initializing
shards.

On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:

Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node
cluster morphed into 2 2-node clusters. I think that's what happened
anyway. We shut all 4 nodes down cleanly, brought them up one at a time
and the cluster reformed into one, however, it's sticking on getting out of
red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a
bunch of
"failures" : [ {
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" : "BroadcastShardOperationFailedException[[co0181ca0711][1]
]; nested: RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]];
nested: IndexMissingException[[co0181ca0711] missing]; "
}, {

which seem to come and go, but not get initialized. With 2 shards and 1
replica, it seems that it should be able to recover the missing index from
the other shard or the replica, but it sticks at this point until I
manually delete what's left of the index. Was this due to the split-brain
issue or is this just a limitation of ES? Is there a way to recover the
missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

Kenneth_Loafman_2 · February 17, 2012, 7:55pm

Hmm, something else is going on. Two of the indexes that were OK
originally are now showing IndexShardMissingException.

ES is indeed hungry!

On Fri, Feb 17, 2012 at 12:58 PM, Kenneth Loafman kenneth@loafman.comwrote:

Nothing was deleted manually or through curl. Is this recoverable at all?
What happened? Was this because of the split-cluster condition?

On Fri, Feb 17, 2012 at 12:54 PM, Shay Banon kimchy@gmail.com wrote:
This means that the shard was supposed to exist on that node, but it
can't be found, are you sure nothing was deleted?

On Friday, February 17, 2012 at 8:49 PM, Kenneth Loafman wrote:

Yes, a whole bunch of messages repeating like this:

[2012-02-17 18:48:00,586][WARN ][cluster.action.shard ] [Blindspot]
received shard failed for [co0198ca0694][1], node[o9nkBCsISKGt7P6acyshHQ],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[co0198ca0694][1] shard allocated for
local recovery (post api), should exists, but doesn't]]]

On Fri, Feb 17, 2012 at 12:45 PM, Shay Banon kimchy@gmail.com wrote:

Do you see anything in the logs? It seems like there are 8 initializing
shards.

On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:

Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node
cluster morphed into 2 2-node clusters. I think that's what happened
anyway. We shut all 4 nodes down cleanly, brought them up one at a time
and the cluster reformed into one, however, it's sticking on getting out of
red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a
bunch of
"failures" : [ {
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" :
"BroadcastShardOperationFailedException[[co0181ca0711][1] ]; nested:
RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]];
nested: IndexMissingException[[co0181ca0711] missing]; "
}, {

which seem to come and go, but not get initialized. With 2 shards and 1
replica, it seems that it should be able to recover the missing index from
the other shard or the replica, but it sticks at this point until I
manually delete what's left of the index. Was this due to the split-brain
issue or is this just a limitation of ES? Is there a way to recover the
missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

Kenneth_Loafman_2 · February 17, 2012, 9:32pm

Any ideas? I've shut down again, forced fsck on next boot, rebooted and
restarted. No real problems found, so we can rule that out. The logs are
too big to gist. What would you need from them if I could find it?

On Fri, Feb 17, 2012 at 1:55 PM, Kenneth Loafman kenneth@loafman.comwrote:

Hmm, something else is going on. Two of the indexes that were OK
originally are now showing IndexShardMissingException.

ES is indeed hungry!

On Fri, Feb 17, 2012 at 12:58 PM, Kenneth Loafman kenneth@loafman.comwrote:
Nothing was deleted manually or through curl. Is this recoverable at
all? What happened? Was this because of the split-cluster condition?

On Fri, Feb 17, 2012 at 12:54 PM, Shay Banon kimchy@gmail.com wrote:
This means that the shard was supposed to exist on that node, but it
can't be found, are you sure nothing was deleted?

On Friday, February 17, 2012 at 8:49 PM, Kenneth Loafman wrote:

Yes, a whole bunch of messages repeating like this:

[2012-02-17 18:48:00,586][WARN ][cluster.action.shard ] [Blindspot]
received shard failed for [co0198ca0694][1], node[o9nkBCsISKGt7P6acyshHQ],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[co0198ca0694][1] shard allocated for
local recovery (post api), should exists, but doesn't]]]

On Fri, Feb 17, 2012 at 12:45 PM, Shay Banon kimchy@gmail.com wrote:

Do you see anything in the logs? It seems like there are 8 initializing
shards.

On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:

Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node
cluster morphed into 2 2-node clusters. I think that's what happened
anyway. We shut all 4 nodes down cleanly, brought them up one at a time
and the cluster reformed into one, however, it's sticking on getting out of
red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a
bunch of
"failures" : [ {
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" :
"BroadcastShardOperationFailedException[[co0181ca0711][1] ]; nested:
RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]];
nested: IndexMissingException[[co0181ca0711] missing]; "
}, {

which seem to come and go, but not get initialized. With 2 shards and 1
replica, it seems that it should be able to recover the missing index from
the other shard or the replica, but it sticks at this point until I
manually delete what's left of the index. Was this due to the split-brain
issue or is this just a limitation of ES? Is there a way to recover the
missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

kimchy · February 17, 2012, 10:35pm

What I meant by data deleted is that some data was deleted from the file system by any chance? I suggest you start to delete the problematic indexes that hold the problematic shards when the cluster is up.

On Friday, February 17, 2012 at 11:32 PM, Kenneth Loafman wrote:

Any ideas? I've shut down again, forced fsck on next boot, rebooted and restarted. No real problems found, so we can rule that out. The logs are too big to gist. What would you need from them if I could find it?

On Fri, Feb 17, 2012 at 1:55 PM, Kenneth Loafman <kenneth@loafman.com (mailto:kenneth@loafman.com)> wrote:
Hmm, something else is going on. Two of the indexes that were OK originally are now showing IndexShardMissingException.

ES is indeed hungry!

On Fri, Feb 17, 2012 at 12:58 PM, Kenneth Loafman <kenneth@loafman.com (mailto:kenneth@loafman.com)> wrote:
Nothing was deleted manually or through curl. Is this recoverable at all? What happened? Was this because of the split-cluster condition?

On Fri, Feb 17, 2012 at 12:54 PM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:
This means that the shard was supposed to exist on that node, but it can't be found, are you sure nothing was deleted?

On Friday, February 17, 2012 at 8:49 PM, Kenneth Loafman wrote:
Yes, a whole bunch of messages repeating like this:

[2012-02-17 18:48:00,586][WARN ][cluster.action.shard ] [Blindspot] received shard failed for [co0198ca0694][1], node[o9nkBCsISKGt7P6acyshHQ], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[co0198ca0694][1] shard allocated for local recovery (post api), should exists, but doesn't]]]

On Fri, Feb 17, 2012 at 12:45 PM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:
Do you see anything in the logs? It seems like there are 8 initializing shards.

On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:
Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node cluster morphed into 2 2-node clusters. I think that's what happened anyway. We shut all 4 nodes down cleanly, brought them up one at a time and the cluster reformed into one, however, it's sticking on getting out of red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a bunch of
"failures" : [ { 
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" : "BroadcastShardOperationFailedException[[co0181ca0711][1] ]; nested: RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]]; nested: IndexMissingException[[co0181ca0711] missing]; "
}, {
which seem to come and go, but not get initialized. With 2 shards and 1 replica, it seems that it should be able to recover the missing index from the other shard or the replica, but it sticks at this point until I manually delete what's left of the index. Was this due to the split-brain issue or is this just a limitation of ES? Is there a way to recover the missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

Kenneth_Loafman_2 · February 17, 2012, 10:58pm

No data was deleted from the filesystem. What I found when I looked was an
empty directory where the shard should have been.

On Fri, Feb 17, 2012 at 4:35 PM, Shay Banon kimchy@gmail.com wrote:

What I meant by data deleted is that some data was deleted from the file
system by any chance? I suggest you start to delete the problematic indexes
that hold the problematic shards when the cluster is up.

On Friday, February 17, 2012 at 11:32 PM, Kenneth Loafman wrote:

Any ideas? I've shut down again, forced fsck on next boot, rebooted and
restarted. No real problems found, so we can rule that out. The logs are
too big to gist. What would you need from them if I could find it?

On Fri, Feb 17, 2012 at 1:55 PM, Kenneth Loafman kenneth@loafman.comwrote:

Hmm, something else is going on. Two of the indexes that were OK
originally are now showing IndexShardMissingException.

ES is indeed hungry!

On Fri, Feb 17, 2012 at 12:58 PM, Kenneth Loafman kenneth@loafman.comwrote:

Nothing was deleted manually or through curl. Is this recoverable at all?
What happened? Was this because of the split-cluster condition?

On Fri, Feb 17, 2012 at 12:54 PM, Shay Banon kimchy@gmail.com wrote:

This means that the shard was supposed to exist on that node, but it
can't be found, are you sure nothing was deleted?

On Friday, February 17, 2012 at 8:49 PM, Kenneth Loafman wrote:

Yes, a whole bunch of messages repeating like this:

[2012-02-17 18:48:00,586][WARN ][cluster.action.shard ] [Blindspot]
received shard failed for [co0198ca0694][1], node[o9nkBCsISKGt7P6acyshHQ],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[co0198ca0694][1] shard allocated for
local recovery (post api), should exists, but doesn't]]]

On Fri, Feb 17, 2012 at 12:45 PM, Shay Banon kimchy@gmail.com wrote:

Do you see anything in the logs? It seems like there are 8 initializing
shards.

On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:

Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node
cluster morphed into 2 2-node clusters. I think that's what happened
anyway. We shut all 4 nodes down cleanly, brought them up one at a time
and the cluster reformed into one, however, it's sticking on getting out of
red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a
bunch of
"failures" : [ {
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" : "BroadcastShardOperationFailedException[[co0181ca0711][1]
]; nested: RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]];
nested: IndexMissingException[[co0181ca0711] missing]; "
}, {

which seem to come and go, but not get initialized. With 2 shards and 1
replica, it seems that it should be able to recover the missing index from
the other shard or the replica, but it sticks at this point until I
manually delete what's left of the index. Was this due to the split-brain
issue or is this just a limitation of ES? Is there a way to recover the
missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

kimchy · February 20, 2012, 12:56pm

Thats strange… . In 0.19, we have a better storage system for local gateway, where state is stored within each index/shard, instead of globally on the node level. I am still not sure what caused the data to be removed though…, elasticsearch does not remove data on its own unless instructed to.

On Saturday, February 18, 2012 at 12:58 AM, Kenneth Loafman wrote:

No data was deleted from the filesystem. What I found when I looked was an empty directory where the shard should have been.

On Fri, Feb 17, 2012 at 4:35 PM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:
What I meant by data deleted is that some data was deleted from the file system by any chance? I suggest you start to delete the problematic indexes that hold the problematic shards when the cluster is up.

On Friday, February 17, 2012 at 11:32 PM, Kenneth Loafman wrote:
Any ideas? I've shut down again, forced fsck on next boot, rebooted and restarted. No real problems found, so we can rule that out. The logs are too big to gist. What would you need from them if I could find it?

On Fri, Feb 17, 2012 at 1:55 PM, Kenneth Loafman <kenneth@loafman.com (mailto:kenneth@loafman.com)> wrote:
Hmm, something else is going on. Two of the indexes that were OK originally are now showing IndexShardMissingException.

ES is indeed hungry!

On Fri, Feb 17, 2012 at 12:58 PM, Kenneth Loafman <kenneth@loafman.com (mailto:kenneth@loafman.com)> wrote:
Nothing was deleted manually or through curl. Is this recoverable at all? What happened? Was this because of the split-cluster condition?

On Fri, Feb 17, 2012 at 12:54 PM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:
This means that the shard was supposed to exist on that node, but it can't be found, are you sure nothing was deleted?

On Friday, February 17, 2012 at 8:49 PM, Kenneth Loafman wrote:
Yes, a whole bunch of messages repeating like this:

[2012-02-17 18:48:00,586][WARN ][cluster.action.shard ] [Blindspot] received shard failed for [co0198ca0694][1], node[o9nkBCsISKGt7P6acyshHQ], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[co0198ca0694][1] shard allocated for local recovery (post api), should exists, but doesn't]]]

On Fri, Feb 17, 2012 at 12:45 PM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:
Do you see anything in the logs? It seems like there are 8 initializing shards.

On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:
Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node cluster morphed into 2 2-node clusters. I think that's what happened anyway. We shut all 4 nodes down cleanly, brought them up one at a time and the cluster reformed into one, however, it's sticking on getting out of red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a bunch of
"failures" : [ {  
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" : "BroadcastShardOperationFailedException[[co0181ca0711][1] ]; nested: RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]]; nested: IndexMissingException[[co0181ca0711] missing]; "
}, {
which seem to come and go, but not get initialized. With 2 shards and 1 replica, it seems that it should be able to recover the missing index from the other shard or the replica, but it sticks at this point until I manually delete what's left of the index. Was this due to the split-brain issue or is this just a limitation of ES? Is there a way to recover the missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

Kenneth_Loafman_2 · February 20, 2012, 3:34pm

But data will be removed if a shard relocates, right? So what happens in a
split-brain situation when the brain is put back together?

On Mon, Feb 20, 2012 at 6:56 AM, Shay Banon kimchy@gmail.com wrote:

Thats strange… . In 0.19, we have a better storage system for local
gateway, where state is stored within each index/shard, instead of globally
on the node level. I am still not sure what caused the data to be removed
though…, elasticsearch does not remove data on its own unless instructed
to.

On Saturday, February 18, 2012 at 12:58 AM, Kenneth Loafman wrote:

No data was deleted from the filesystem. What I found when I looked was
an empty directory where the shard should have been.

On Fri, Feb 17, 2012 at 4:35 PM, Shay Banon kimchy@gmail.com wrote:

What I meant by data deleted is that some data was deleted from the file
system by any chance? I suggest you start to delete the problematic indexes
that hold the problematic shards when the cluster is up.

On Friday, February 17, 2012 at 11:32 PM, Kenneth Loafman wrote:

Any ideas? I've shut down again, forced fsck on next boot, rebooted and
restarted. No real problems found, so we can rule that out. The logs are
too big to gist. What would you need from them if I could find it?

On Fri, Feb 17, 2012 at 1:55 PM, Kenneth Loafman kenneth@loafman.comwrote:

Hmm, something else is going on. Two of the indexes that were OK
originally are now showing IndexShardMissingException.

ES is indeed hungry!

On Fri, Feb 17, 2012 at 12:58 PM, Kenneth Loafman kenneth@loafman.comwrote:

Nothing was deleted manually or through curl. Is this recoverable at all?
What happened? Was this because of the split-cluster condition?

On Fri, Feb 17, 2012 at 12:54 PM, Shay Banon kimchy@gmail.com wrote:

This means that the shard was supposed to exist on that node, but it
can't be found, are you sure nothing was deleted?

On Friday, February 17, 2012 at 8:49 PM, Kenneth Loafman wrote:

Yes, a whole bunch of messages repeating like this:

[2012-02-17 18:48:00,586][WARN ][cluster.action.shard ] [Blindspot]
received shard failed for [co0198ca0694][1], node[o9nkBCsISKGt7P6acyshHQ],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[co0198ca0694][1] shard allocated for
local recovery (post api), should exists, but doesn't]]]

On Fri, Feb 17, 2012 at 12:45 PM, Shay Banon kimchy@gmail.com wrote:

Do you see anything in the logs? It seems like there are 8 initializing
shards.

On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:

Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node
cluster morphed into 2 2-node clusters. I think that's what happened
anyway. We shut all 4 nodes down cleanly, brought them up one at a time
and the cluster reformed into one, however, it's sticking on getting out of
red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a
bunch of
"failures" : [ {
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" : "BroadcastShardOperationFailedException[[co0181ca0711][1]
]; nested: RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]];
nested: IndexMissingException[[co0181ca0711] missing]; "
}, {

which seem to come and go, but not get initialized. With 2 shards and 1
replica, it seems that it should be able to recover the missing index from
the other shard or the replica, but it sticks at this point until I
manually delete what's left of the index. Was this due to the split-brain
issue or is this just a limitation of ES? Is there a way to recover the
missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

Kenneth_Loafman_2 · February 21, 2012, 3:30pm

I'd like to understand what could have caused this loss of shards. What
data do you need in order to track it down?

On Mon, Feb 20, 2012 at 9:34 AM, Kenneth Loafman kenneth@loafman.comwrote:

But data will be removed if a shard relocates, right? So what happens in
a split-brain situation when the brain is put back together?

On Mon, Feb 20, 2012 at 6:56 AM, Shay Banon kimchy@gmail.com wrote:
Thats strange… . In 0.19, we have a better storage system for local
gateway, where state is stored within each index/shard, instead of globally
on the node level. I am still not sure what caused the data to be removed
though…, elasticsearch does not remove data on its own unless instructed
to.

On Saturday, February 18, 2012 at 12:58 AM, Kenneth Loafman wrote:

No data was deleted from the filesystem. What I found when I looked was
an empty directory where the shard should have been.

On Fri, Feb 17, 2012 at 4:35 PM, Shay Banon kimchy@gmail.com wrote:

What I meant by data deleted is that some data was deleted from the file
system by any chance? I suggest you start to delete the problematic indexes
that hold the problematic shards when the cluster is up.

On Friday, February 17, 2012 at 11:32 PM, Kenneth Loafman wrote:

Any ideas? I've shut down again, forced fsck on next boot, rebooted and
restarted. No real problems found, so we can rule that out. The logs are
too big to gist. What would you need from them if I could find it?

On Fri, Feb 17, 2012 at 1:55 PM, Kenneth Loafman kenneth@loafman.comwrote:

Hmm, something else is going on. Two of the indexes that were OK
originally are now showing IndexShardMissingException.

ES is indeed hungry!

On Fri, Feb 17, 2012 at 12:58 PM, Kenneth Loafman kenneth@loafman.comwrote:

Nothing was deleted manually or through curl. Is this recoverable at
all? What happened? Was this because of the split-cluster condition?

On Fri, Feb 17, 2012 at 12:54 PM, Shay Banon kimchy@gmail.com wrote:

This means that the shard was supposed to exist on that node, but it
can't be found, are you sure nothing was deleted?

On Friday, February 17, 2012 at 8:49 PM, Kenneth Loafman wrote:

Yes, a whole bunch of messages repeating like this:

[2012-02-17 18:48:00,586][WARN ][cluster.action.shard ] [Blindspot]
received shard failed for [co0198ca0694][1], node[o9nkBCsISKGt7P6acyshHQ],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[co0198ca0694][1] shard allocated for
local recovery (post api), should exists, but doesn't]]]

On Fri, Feb 17, 2012 at 12:45 PM, Shay Banon kimchy@gmail.com wrote:

Do you see anything in the logs? It seems like there are 8 initializing
shards.

On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:

Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node
cluster morphed into 2 2-node clusters. I think that's what happened
anyway. We shut all 4 nodes down cleanly, brought them up one at a time
and the cluster reformed into one, however, it's sticking on getting out of
red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a
bunch of
"failures" : [ {
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" :
"BroadcastShardOperationFailedException[[co0181ca0711][1] ]; nested:
RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]];
nested: IndexMissingException[[co0181ca0711] missing]; "
}, {

which seem to come and go, but not get initialized. With 2 shards and 1
replica, it seems that it should be able to recover the missing index from
the other shard or the replica, but it sticks at this point until I
manually delete what's left of the index. Was this due to the split-brain
issue or is this just a limitation of ES? Is there a way to recover the
missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

kimchy · February 21, 2012, 4:03pm

Maybe we can start with logs from the time that you first restarted the cluster?

On Tuesday, February 21, 2012 at 5:30 PM, Kenneth Loafman wrote:

I'd like to understand what could have caused this loss of shards. What data do you need in order to track it down?

On Mon, Feb 20, 2012 at 9:34 AM, Kenneth Loafman <kenneth@loafman.com (mailto:kenneth@loafman.com)> wrote:
But data will be removed if a shard relocates, right? So what happens in a split-brain situation when the brain is put back together?

On Mon, Feb 20, 2012 at 6:56 AM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:
Thats strange… . In 0.19, we have a better storage system for local gateway, where state is stored within each index/shard, instead of globally on the node level. I am still not sure what caused the data to be removed though…, elasticsearch does not remove data on its own unless instructed to.

On Saturday, February 18, 2012 at 12:58 AM, Kenneth Loafman wrote:
No data was deleted from the filesystem. What I found when I looked was an empty directory where the shard should have been.

On Fri, Feb 17, 2012 at 4:35 PM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:
What I meant by data deleted is that some data was deleted from the file system by any chance? I suggest you start to delete the problematic indexes that hold the problematic shards when the cluster is up.

On Friday, February 17, 2012 at 11:32 PM, Kenneth Loafman wrote:
Any ideas? I've shut down again, forced fsck on next boot, rebooted and restarted. No real problems found, so we can rule that out. The logs are too big to gist. What would you need from them if I could find it?

On Fri, Feb 17, 2012 at 1:55 PM, Kenneth Loafman <kenneth@loafman.com (mailto:kenneth@loafman.com)> wrote:
Hmm, something else is going on. Two of the indexes that were OK originally are now showing IndexShardMissingException.

ES is indeed hungry!

On Fri, Feb 17, 2012 at 12:58 PM, Kenneth Loafman <kenneth@loafman.com (mailto:kenneth@loafman.com)> wrote:
Nothing was deleted manually or through curl. Is this recoverable at all? What happened? Was this because of the split-cluster condition?

On Fri, Feb 17, 2012 at 12:54 PM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:
This means that the shard was supposed to exist on that node, but it can't be found, are you sure nothing was deleted?

On Friday, February 17, 2012 at 8:49 PM, Kenneth Loafman wrote:
Yes, a whole bunch of messages repeating like this:

[2012-02-17 18:48:00,586][WARN ][cluster.action.shard ] [Blindspot] received shard failed for [co0198ca0694][1], node[o9nkBCsISKGt7P6acyshHQ], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[co0198ca0694][1] shard allocated for local recovery (post api), should exists, but doesn't]]]

On Fri, Feb 17, 2012 at 12:45 PM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:
Do you see anything in the logs? It seems like there are 8 initializing shards.

On Friday, February 17, 2012 at 8:34 PM, Kenneth Loafman wrote:
Hi,

We're on ES 18.6 driving a 4-node cluster on RackSpace.

Last night we had a nework outage on two of the nodes and our 4-node cluster morphed into 2 2-node clusters. I think that's what happened anyway. We shut all 4 nodes down cleanly, brought them up one at a time and the cluster reformed into one, however, it's sticking on getting out of red.

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 234,
"active_shards" : 426,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 126
}

It will stay in this mode for a long time and the '_status' will show a bunch of
"failures" : [ {  
  "index" : "co0181ca0711",
  "shard" : 1,
  "reason" : "BroadcastShardOperationFailedException[[co0181ca0711][1] ]; nested: RemoteTransportException[[Whiteout][inet[/10.177.166.64:9300]][indices/status/shard]]; nested: IndexMissingException[[co0181ca0711] missing]; "
}, {
which seem to come and go, but not get initialized. With 2 shards and 1 replica, it seems that it should be able to recover the missing index from the other shard or the replica, but it sticks at this point until I manually delete what's left of the index. Was this due to the split-brain issue or is this just a limitation of ES? Is there a way to recover the missing index from the replica? How do I find the replicas?

...Thanks,
...Ken

Topic		Replies	Views
Disappearing Shards Elasticsearch	10	414	July 6, 2017
0.19.10 - cluster wedged, most operations failing Elasticsearch	4	479	July 6, 2017
Cluster questions Elasticsearch	7	375	July 6, 2017
Constant Recovering and Unassigned shards for an index Elasticsearch	12	1022	July 6, 2017
Indexing/shard failure Elasticsearch	5	998	July 6, 2017

ES Ate My Shards/Indexes

Related topics