SOLVED: Unassigned shards after restart, allocated for local recovery, should exist but doesn't + no segments file found in store

unknownunknown · November 4, 2015, 5:51pm

I think my cluster is in an unrecoverable state after a recent restart.

It is possible that the master did not shut down cleanly. Now, at cluster start, more than half of the shards are UNASSIGNED and their indices are RED. I can use the cluster, but more than half of my data is inaccessible.

I want to understand: (1) what state is my cluster in? (2) Can I nudge it back into recovery -- if so, how; if not, why not? (3) How can I prevent this in the future?

I have a two node cluster, a master and a data node. I've been running a number of scenarios on the cluster. I have many different indices with many different shard counts. Most of the indices do not have replicas. The ones that do, I can afford losing.

I believe the cluster was idle when I shut it down.

I was initially running 1.6.1, but upgraded to 1.7.3 to see if it helped with recovery. Upgrading to 2.x is not an option for me currently.

I've gone through many different variants of "when shards are UNASSIGNED, do this", including changing allocation, and trying to force explicit routing. I also removed the 'segments.gen' files from these directories. None of these tasks seem to alleviate the UNASSIGNED state.

The volume of logging (>8000 lines) makes it hard to identify exactly what is wrong, and what I should do. If I can't resolve this quickly, I'll have to delete the RED indices with UNASSIGNED shards.

Can I recover from this state with a minimal loss of data?

The log file would be big. I'll try to include useful highlights.


Nov  4 10:07:15 localhost [WARN ][indices.cluster          ] [elasticsearch1] [[newdata-test-2015.10.14][5]] marking and sending shard failed due to [failed recovery]
Nov  4 10:07:15 localhost org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [newdata-test-2015.10.14][5] failed to fetch index version after copying it over
Nov  4 10:07:15 localhost     at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:161)
Nov  4 10:07:15 localhost     at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
Nov  4 10:07:15 localhost     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Nov  4 10:07:15 localhost     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Nov  4 10:07:15 localhost     at java.lang.Thread.run(Thread.java:745)
Nov  4 10:07:15 Caused by: org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [newdata-test-2015.10.14][5] shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock]
Nov  4 10:07:15 localhost     at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:135)
Nov  4 10:07:15 localhost     ... 4 more

.... (many more similar) ....

Occasionally I'll see things like:
Nov 4 10:07:45 localhost [WARN ][cluster.action.shard ] [elasticsearch1] [testindex-2015.07.31][0] received shard failed for [testindex-2015.07.31][0], node[rmSQDHKGQ6-VI6sEj4ERNA], [P], s[INITIALIZING], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-11-04T15:07:14.255Z]], indexUUID [B1aI44bgQWCsNDHJEPknDw], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[testindex-2015.07.31][0] failed to fetch index version after copying it over]; nested: IndexShardGatewayRecoveryException[[testindex-2015.07.31][0] shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock, _90.cfs, _43.cfe, _90.cfe, _5b.cfs, _2q.fdx, _74_Lucene41_0.doc, _h67.si, _5b.c... ... Nov 4 10:07:45 localhost [WARN ][indices.cluster ] [elasticsearch1] [[testindex-2015.08.01][0]] marking and sending shard failed due to [failed recovery]

The last line in the log from this start is:
Nov 4 10:07:48 localhost [WARN ][cluster.action.shard ] [elasticsearch1] [kibana-int][2] received shard failed for [kibana-int][2], node[rmSQDHKGQ6-VI6sEj4ERNA], [P], s[INITIALIZING], unassigned_info[[reason=CLUSTER_RECOVERED], at[2015-11-04T15:07:14.275Z]], indexUUID [w-i8HQtCSZ2e7wcHcMy5gw], reason [master [elasticsearch1][rmSQDHKGQ6-VI6sEj4ERNA][esmaster1][inet[/192.168.1.42:9300]]{master=true} marked shard as initializing, but shard is marked as failed, resend shard failure]

Many thanks for any help that can be offered.

unknownunknown · November 25, 2015, 3:37pm

For my case, I think the problem was that the cluster was actually unable to communicate. I restarted the box and the firewall returned to a default, secure state. Because of the firewall, the nodes in the cluster were unable to communicate fully. I discovered this after I tried to delete the indices -- and that failed as well. Of course, I unfortunately lost most of my indices as a result. It seems like the error messages could be clearer in this circumstance.

Topic		Replies	Views
Unable to recover my cluster Elasticsearch	13	544	June 7, 2023
Unassigned shards on cluster restart Elasticsearch	1	693	October 2, 2018
On full cluster restart, shards are not recovering and remaining in Unassigned state Elasticsearch	1	612	July 5, 2017
Unnassigned Shards After Node Restart Elasticsearch	3	523	July 5, 2017
ES Cluster Recovery and Restart Elasticsearch	3	591	July 6, 2017

SOLVED: Unassigned shards after restart, allocated for local recovery, should exist but doesn't + no segments file found in store

Related topics