SOLVED: Unassigned shards after restart, allocated for local recovery, should exist but doesn't + no segments file found in store


#1

I think my cluster is in an unrecoverable state after a recent restart.

It is possible that the master did not shut down cleanly. Now, at cluster start, more than half of the shards are UNASSIGNED and their indices are RED. I can use the cluster, but more than half of my data is inaccessible.

I want to understand: (1) what state is my cluster in? (2) Can I nudge it back into recovery -- if so, how; if not, why not? (3) How can I prevent this in the future?

I have a two node cluster, a master and a data node. I've been running a number of scenarios on the cluster. I have many different indices with many different shard counts. Most of the indices do not have replicas. The ones that do, I can afford losing.

I believe the cluster was idle when I shut it down.

I was initially running 1.6.1, but upgraded to 1.7.3 to see if it helped with recovery. Upgrading to 2.x is not an option for me currently.

I've gone through many different variants of "when shards are UNASSIGNED, do this", including changing allocation, and trying to force explicit routing. I also removed the 'segments.gen' files from these directories. None of these tasks seem to alleviate the UNASSIGNED state.

The volume of logging (>8000 lines) makes it hard to identify exactly what is wrong, and what I should do. If I can't resolve this quickly, I'll have to delete the RED indices with UNASSIGNED shards.

Can I recover from this state with a minimal loss of data?

The log file would be big. I'll try to include useful highlights.

Nov 4 10:07:15 localhost [WARN ][indices.cluster ] [elasticsearch1] [[newdata-test-2015.10.14][5]] marking and sending shard failed due to [failed recovery] Nov 4 10:07:15 localhost org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [newdata-test-2015.10.14][5] failed to fetch index version after copying it over Nov 4 10:07:15 localhost at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:161) Nov 4 10:07:15 localhost at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112) Nov 4 10:07:15 localhost at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) Nov 4 10:07:15 localhost at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) Nov 4 10:07:15 localhost at java.lang.Thread.run(Thread.java:745) Nov 4 10:07:15 Caused by: org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [newdata-test-2015.10.14][5] shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock] Nov 4 10:07:15 localhost at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:135) Nov 4 10:07:15 localhost ... 4 more .... (many more similar) ....

Occasionally I'll see things like:

Nov 4 10:07:45 localhost [WARN ][cluster.action.shard ] [elasticsearch1] [testindex-2015.07.31][0] received shard failed for [testindex-2015.07.31][0], node[rmSQDHKGQ6-VI6sEj4ERNA], [P], s[INITIALIZING], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-11-04T15:07:14.255Z]], indexUUID [B1aI44bgQWCsNDHJEPknDw], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[testindex-2015.07.31][0] failed to fetch index version after copying it over]; nested: IndexShardGatewayRecoveryException[[testindex-2015.07.31][0] shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock, _90.cfs, _43.cfe, _90.cfe, _5b.cfs, _2q.fdx, _74_Lucene41_0.doc, _h67.si, _5b.c...
...
Nov 4 10:07:45 localhost [WARN ][indices.cluster ] [elasticsearch1] [[testindex-2015.08.01][0]] marking and sending shard failed due to [failed recovery]

The last line in the log from this start is:

Nov 4 10:07:48 localhost [WARN ][cluster.action.shard ] [elasticsearch1] [kibana-int][2] received shard failed for [kibana-int][2], node[rmSQDHKGQ6-VI6sEj4ERNA], [P], s[INITIALIZING], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-11-04T15:07:14.275Z]], indexUUID [w-i8HQtCSZ2e7wcHcMy5gw], reason [master [elasticsearch1][rmSQDHKGQ6-VI6sEj4ERNA][esmaster1][inet[/192.168.1.42:9300]]{master=true} marked shard as initializing, but shard is marked as failed, resend shard failure]

Many thanks for any help that can be offered.


#2

For my case, I think the problem was that the cluster was actually unable to communicate. I restarted the box and the firewall returned to a default, secure state. Because of the firewall, the nodes in the cluster were unable to communicate fully. I discovered this after I tried to delete the indices -- and that failed as well. Of course, I unfortunately lost most of my indices as a result. It seems like the error messages could be clearer in this circumstance.


(system) #3