Cluster went red earlier today, possibly due to a node with a failing disk
drive. (RAID, writes would've been slow but hopefully not corrupted.)
Shutting down the suspect node didn't help so we restarted the whole
cluster, less the suspect node. Came up red, some shards stayed in
"initializing", some stayed unassigned. Got the suspect node's hardware
OK, restarted the cluster with that node... still no change. We're stuck
here:
curl 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "production",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 20,
"number_of_data_nodes" : 20,
"active_primary_shards" : 186,
"active_shards" : 361,
"relocating_shards" : 0,
"initializing_shards" : 9,
"unassigned_shards" : 22
}
We can write new data to one of our indices but most cluster maintenance
commands fail... can't delete aliases, close indices, open new indices --
curl just hangs, no errors or such.
Logs have the typical "startup stuff" in them; I'm accustomed to seeing
these messages during cluster startup but normally they're temporary:
[22:06:00,555][WARN ][indices.cluster ] [elastic-004]
[2012102902][14] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[2012102902][14] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:122)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
[22:06:00,566][WARN ][cluster.action.shard ] [elastic-004] sending
failed shard for [2012102902][14], node[PI-PwP7GRBal2k_MseBUVA], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[2012102902][14] shard allocated for
local recovery (post api), should exists, but doesn't]]]
This is a production cluster but if there's no way to quickly recover the
indices which are red we could get by without them... but I'm not sure how
to safely remove them.
Help/suggestions? Thanks!
-Robert.-
--