0.19.10 - cluster wedged, most operations failing

Cluster went red earlier today, possibly due to a node with a failing disk
drive. (RAID, writes would've been slow but hopefully not corrupted.)

Shutting down the suspect node didn't help so we restarted the whole
cluster, less the suspect node. Came up red, some shards stayed in
"initializing", some stayed unassigned. Got the suspect node's hardware
OK, restarted the cluster with that node... still no change. We're stuck
here:

curl 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "production",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 20,
"active_primary_shards" : 186,
"active_shards" : 361,
"relocating_shards" : 0,
"initializing_shards" : 9,
"unassigned_shards" : 22
}

We can write new data to one of our indices but most cluster maintenance
commands fail... can't delete aliases, close indices, open new indices --
curl just hangs, no errors or such.

Logs have the typical "startup stuff" in them; I'm accustomed to seeing
these messages during cluster startup but normally they're temporary:

[22:06:00,555][WARN ][indices.cluster ] [elastic-004]
[2012102902][14] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[2012102902][14] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:122)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
[22:06:00,566][WARN ][cluster.action.shard ] [elastic-004] sending
failed shard for [2012102902][14], node[PI-PwP7GRBal2k_MseBUVA], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[2012102902][14] shard allocated for
local recovery (post api), should exists, but doesn't]]]

This is a production cluster but if there's no way to quickly recover the
indices which are red we could get by without them... but I'm not sure how
to safely remove them.

Help/suggestions? Thanks!

        -Robert.-

--

Hello Robert,

Please share the config settings for the following:
gateway.recover_after_nodes
gateway.recover_after_time
gateway.expected_nodes

I had a similar situation with 0.19.7 with default settings on the
above for a 4-node cluster. Starting with 0.19.8 we got automatic
import of dangling indices by default:

I didn't get the chance to play with this new feature, but I'd suspect
one might still get into trouble with the default settings on a
cluster with more than 2 nodes. It might help to set those according
to your cluster (if you haven't already, of course :p) and restart the
whole cluster again.

If you still have issues, search for the missing shards in data
directories of your nodes (for example, nodes/0/indices/2012102902/14/
from your example below). If you can't find them or they're empty,
then I guess your data is gone for some reason - and you'd have to
reindex or restore from backup. To remove the index safely you would
do something like:

curl -XDELETE localhost:9200/index_name_goes_here

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Mon, Nov 5, 2012 at 5:10 AM, Robert Tarrall tarrall@gmail.com wrote:

Cluster went red earlier today, possibly due to a node with a failing disk
drive. (RAID, writes would've been slow but hopefully not corrupted.)

Shutting down the suspect node didn't help so we restarted the whole
cluster, less the suspect node. Came up red, some shards stayed in
"initializing", some stayed unassigned. Got the suspect node's hardware OK,
restarted the cluster with that node... still no change. We're stuck here:

curl 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "production",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 20,
"active_primary_shards" : 186,
"active_shards" : 361,
"relocating_shards" : 0,
"initializing_shards" : 9,
"unassigned_shards" : 22
}

We can write new data to one of our indices but most cluster maintenance
commands fail... can't delete aliases, close indices, open new indices --
curl just hangs, no errors or such.

Logs have the typical "startup stuff" in them; I'm accustomed to seeing
these messages during cluster startup but normally they're temporary:

[22:06:00,555][WARN ][indices.cluster ] [elastic-004]
[2012102902][14] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[2012102902][14] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:122)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
[22:06:00,566][WARN ][cluster.action.shard ] [elastic-004] sending
failed shard for [2012102902][14], node[PI-PwP7GRBal2k_MseBUVA], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[2012102902][14] shard allocated for
local recovery (post api), should exists, but doesn't]]]

This is a production cluster but if there's no way to quickly recover the
indices which are red we could get by without them... but I'm not sure how
to safely remove them.

Help/suggestions? Thanks!

        -Robert.-

--

--

On Monday, November 5, 2012 12:36:33 AM UTC-7, Radu Gheorghe wrote:

Hello Robert,

Please share the config settings for the following:
gateway.recover_after_nodes
gateway.recover_after_time
gateway.expected_nodes

Thanks Radu! We ended up "deleting" the indices by shutting down the
cluster, moving the directories corresponding to those indices, and then
bringing the cluster back up, so this is no longer urgent. However it is
concerning since there was pretty substantial data loss.

We have:
recover_after_nodes: 18
recover_after_time: 5m

gateway.expected_nodes is not defined. When this started we had 19 nodes
in the cluster (now have 20; haven't changed the settings though). All
nodes were started at the same time and all joined the cluster before it
began recovery.

My reading of the docs is that without expected_nodes defined, it's going
to wait 5 minutes before beginning recovery; if that's incorrect or the
above settings are a bad idea please let me know.

Thanks for mentioning the new dangling indices settings; we noticed that
last night and are thinking we might set it to "closed" on our cluster as
that seems possibly safer than the default.

If you still have issues, search for the missing shards in data
directories of your nodes (for example, nodes/0/indices/2012102902/14/
from your example below). If you can't find them or they're empty,
then I guess your data is gone for some reason - and you'd have to
reindex or restore from backup. To remove the index safely you would
do something like:

Shards were not present in the directories. E.g. index 2012102902, shard
10 was supposedly initializing on node elastic-016 -- here's the _status
snippet:

        "state" : "INITIALIZING",
        "primary" : true,
        "node" : "Wwprc9egTJKR-qBskhRyRQ",
        "relocating_node" : null,
        "shard" : 10,
        "index" : "2012102902"

but here's what the directory looked like:

ls -la 2012102902/10//
-rw-r--r-- 1 root root 39 Nov 4 14:49 2012102902/10/_state/state-8

cat 2012102902/10/_state/state-8
{
"version" : 8,
"primary" : true
}

Would love to know how to track down how that shard disappeared. Just had
a look on all the other nodes and none of them have a 2012102902/10
directory at all.

curl -XDELETE localhost:9200/index_name_goes_here

Didn't try that but note that all other index API calls, including creating
new indices and changing aliases on a index (even one which was green),
failed so I was kinda expecting that to fail too. Plus, by keeping the old
data around we hoped there might be some postmortem we could do,

       -Robert.-

--

Hello Robert,

On Mon, Nov 5, 2012 at 5:18 PM, Robert Tarrall tarrall@gmail.com wrote:

On Monday, November 5, 2012 12:36:33 AM UTC-7, Radu Gheorghe wrote:

Hello Robert,

Please share the config settings for the following:
gateway.recover_after_nodes
gateway.recover_after_time
gateway.expected_nodes

Thanks Radu! We ended up "deleting" the indices by shutting down the
cluster, moving the directories corresponding to those indices, and then
bringing the cluster back up, so this is no longer urgent. However it is
concerning since there was pretty substantial data loss.

We have:
recover_after_nodes: 18
recover_after_time: 5m

gateway.expected_nodes is not defined. When this started we had 19 nodes in
the cluster (now have 20; haven't changed the settings though). All nodes
were started at the same time and all joined the cluster before it began
recovery.

My reading of the docs is that without expected_nodes defined, it's going to
wait 5 minutes before beginning recovery; if that's incorrect or the above
settings are a bad idea please let me know.

My understanding is the same.

Thanks for mentioning the new dangling indices settings; we noticed that
last night and are thinking we might set it to "closed" on our cluster as
that seems possibly safer than the default.

If you still have issues, search for the missing shards in data
directories of your nodes (for example, nodes/0/indices/2012102902/14/
from your example below). If you can't find them or they're empty,
then I guess your data is gone for some reason - and you'd have to
reindex or restore from backup. To remove the index safely you would
do something like:

Shards were not present in the directories. E.g. index 2012102902, shard 10
was supposedly initializing on node elastic-016 -- here's the _status
snippet:

        "state" : "INITIALIZING",
        "primary" : true,
        "node" : "Wwprc9egTJKR-qBskhRyRQ",
        "relocating_node" : null,
        "shard" : 10,
        "index" : "2012102902"

but here's what the directory looked like:

ls -la 2012102902/10//
-rw-r--r-- 1 root root 39 Nov 4 14:49 2012102902/10/_state/state-8

cat 2012102902/10/_state/state-8
{
"version" : 8,
"primary" : true
}

Would love to know how to track down how that shard disappeared. Just had a
look on all the other nodes and none of them have a 2012102902/10 directory
at all.

I'm not sure why the shards disappeared in your case. Any clues in the
logs from the day you had the failure?

curl -XDELETE localhost:9200/index_name_goes_here

Didn't try that but note that all other index API calls, including creating
new indices and changing aliases on a index (even one which was green),
failed so I was kinda expecting that to fail too. Plus, by keeping the old
data around we hoped there might be some postmortem we could do,

I see. Well, you can always copy the data directory to make sure you
don't completely lose that data. But you'd have to do it when ES is
stopped on that node - otherwise, disable flush while copying:

curl -XPUT localhost:9200/index_name/_settings -d '{
"index": {
"translog.disable_flush": true
}
}'

And of course enable it back again when you're done :slight_smile:

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--