Recovering From Corrupted Shard Following Upgrade to 1.3.1

A few days after the upgrade to 1.3.1 we experienced our first corrupted
shard in a 2 node cluster:

[2014-08-06 15:54:28,815][WARN ][indices.cluster ]
[FiveAces.Coffee.Web_IN_0] [streamentry5][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[streamentry5][4] failed to fetch index version after copying it over
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.lucene.index.CorruptIndexException: [streamentry5][4]
Corrupted index [corrupted_fuDt8NuqR_egGJK0fcjl6g] caused by:
CorruptIndexException[Invalid fieldsStream maxPointer (file truncated?):
maxPointer=6833538, length=524288]
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:343)
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:328)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:119)
... 4 more

How do we recover from this?

We've tried explicitly assigning via the reroute API:

{ "commands" : [ { "allocate" : { "index" : "streamentry5", "shard" : 4 ,
"node" : "FiveAces.Coffee.Web_IN_0", "allow_primary" : 1 }}]}

This puts the shard in INITIALIZING but quickly reverts back to UNALLOCATED
with a similar error in the logs.

I'm interested in theories on how this could have happened assuming no
significant changes on our end during this period and never having
experienced this on ES before but more importantly how to recover from it.

Thank you

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d225d1cd-79a6-455c-a4d0-6cf0dfd88314%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I should mention that there is a primary shard 4 on the other node, just
need to understand why it's not auto recovering here what I can do to
manually remove the corrupted shard to have the primary replicated to this
node.

On Wednesday, August 6, 2014 12:44:41 PM UTC-4, Nariman Haghighi wrote:

A few days after the upgrade to 1.3.1 we experienced our first corrupted
shard in a 2 node cluster:

[2014-08-06 15:54:28,815][WARN ][indices.cluster ]
[FiveAces.Coffee.Web_IN_0] [streamentry5][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[streamentry5][4] failed to fetch index version after copying it over
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.lucene.index.CorruptIndexException:
[streamentry5][4] Corrupted index [corrupted_fuDt8NuqR_egGJK0fcjl6g] caused
by: CorruptIndexException[Invalid fieldsStream maxPointer (file
truncated?): maxPointer=6833538, length=524288]
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:343)
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:328)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:119)
... 4 more

How do we recover from this?

We've tried explicitly assigning via the reroute API:

{ "commands" : [ { "allocate" : { "index" : "streamentry5", "shard" : 4 ,
"node" : "FiveAces.Coffee.Web_IN_0", "allow_primary" : 1 }}]}

This puts the shard in INITIALIZING but quickly reverts back to
UNALLOCATED with a similar error in the logs.

I'm interested in theories on how this could have happened assuming no
significant changes on our end during this period and never having
experienced this on ES before but more importantly how to recover from it.

Thank you

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/96692c31-c938-41dd-aeb4-d4e61a9a515d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.